#1
  1. (retired)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2003
    Location
    The Laboratory
    Posts
    10,101
    Rep Power
    0

    decomposing html entities?


    Evening all,

    Does anyone know of some way of decomposing html entities (eg &#62 to text (utf8/latin-1 whatever)?

    I can't see anything in the manuals I have and a google didn't turn up anything useful.

    Or will I have to write a string replace system?

    --Simon
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    A while back when I first worked with HTML entities from Python (for my Net module below) I wrote several functions for converting special characters to entities, however I've never needed a function to convert them back again .

    For anyone who's interested: http://forums.devshed.com/t129666/s.html&highlight=net+module

    But now I have the excuse, heres a simple function that should do what you want. It lacks any error checking but should work will all valid [numeric] entities.

    Code:
    def convert(entity):
        return chr(int(entity[2:-1]))
    Here's the same thing as a list comprehension in the Python shell, just to show the conversion in action:

    Code:
    >>> entities = ('>', '&', '<')
    >>> [chr(int(entity[2:-1])) for entity in entities]
    ['>', '&', '<']
    >>>
    As you can see it's surprisingly easy! And converting back again isn't much more difficult once you know how it all works .

    Hope this helps,

    Mark.
    Last edited by netytan; January 28th, 2005 at 12:51 PM. Reason: Added URL to my Net Module.
    programming language development: www.netytan.com Hula

  4. #3
  5. (retired)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2003
    Location
    The Laboratory
    Posts
    10,101
    Rep Power
    0
    Mental Note: search devshed before asking

    Thanks Mark - this looks excellent.

    -Simon

IMN logo majestic logo threadwatch logo seochat tools logo