Page 1 of 2 12 Last
  • Jump to page:
    #1
  1. No Profile Picture
    I hate nerds
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2003
    Posts
    540
    Rep Power
    0

    regular expression for html


    i need to extract tags and their contents, using any method. i think a good way is using re, so here goes.

    r'<[a-zA-Z]+/W*>/W*</[a-zA-Z]>'

    i THINK /W means everything under the sun including spaces...but im not sure. if its not, please let me know.

    one problem here that i can think of right away is that this will work

    <a>sdfs</b>

    is there a way that i can specify that a certain group must equal a different group?
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    I think the trick with this one is to use backreferances although the closest i've got was this regex:

    '<(.+?).*?>(.+?)</\\1>'

    with findall(), however this wont handle nested tags at all so would have to be called on your results untill all the data is retrieved..

    What about removing all the html and parsing the results? It really depends on what your after..

    Period matches everything except newlines and you can make it match even them using the re.DOTALL flag when you compile your regex (assuming your compiling them). \W matches any none alphanumeric char.

    Although for parsing html i'd suggest you use one of Pythons built-in modules since parsing html can be quite tricky! Which is part of the reason xhtml was created!

    http://www.python.org/doc/current/li...e-htmllib.html
    http://www.python.org/doc/current/li...TMLParser.html

    Mark.
    programming language development: www.netytan.com Hula

  4. #3
  5. No Profile Picture
    I hate nerds
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2003
    Posts
    540
    Rep Power
    0
    im new to python (just started a few days ago) but i think what u wrote is all i need. thanks!

    all i need to do is to have a regex to find the first tag in a string of text and to extract the tag type, and everything in between that tag, including nested tags.

    '<(?P<tagname>.+?).*?>(?P<contents>.+?)</\\1>'

    i assume that \\1 references the first group?

    wouldn't '\\1' be a re for \1?

    aren't html tags limited to starting with letters? i dont think there is a tag starting with a number. could be wrong though
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Welcome to the Python world sad, i think you'll like it ..it sounds pretty simple so regex should work fine for this one.

    To answer your question yes \\1 was a referance too the first group (in this case the tags name). You could use r'' but i like to see what i'm doing when it comes to with regex

    Nope, there arent any html tags witch start with a numer but you do have things like H1 etc.

    Let me know how it goes

    Mark.
    programming language development: www.netytan.com Hula

  8. #5
  9. No Profile Picture
    I hate nerds
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2003
    Posts
    540
    Rep Power
    0
    thanks mark. perhaps u could help me again...somethings weird.
    this is my re pattern.

    htmlparse = r'(?P<tag><(?P<tagname>[!a-zA-Z][^\s>-]*)[^>]*>((?P<contents>.*)</\2>)?)|(?P<textnode>\s*[^<\s][^<]*\s*)'

    basically i check first for an html tag, taking into account that tags can be like

    <meta>
    <b></b>

    the second part takes into account text nodes because im trying to extract the DOM model.

    anyway...here is one instance it does not work

    "<!-- <b></b><meta></html>"

    ('<!-- <b>', '!', None, None, None)

    the first group is 'tag', and it shows a match for '<!-- <b>', but how can that be?
    the second group, tagname is '!' and with that backreference there should be no match since there is no instance of '</!>'
  10. #6
  11. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Yeah i'd be happy to help , u r gonna have to explain something though.. what is with all the ?P's and <tagname>'s, i've never seem them before, i assumed they where just some kind of comment but i'm not too sure now

    Anyway do you have an example of that you want to parse so we can work out the best way to do this?

    Mark.
    programming language development: www.netytan.com Hula

  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    133
    Rep Power
    12
    There is a nice way to use the re.split function with html code. It's a bit simple, but it can be useful.

    Normally, split will remove the token on which you split, but if you surround it with parentheses, it will include it.
    Code:
    >>> html = """
    ... <html>
    ... <body>
    ... <p><b>Hey there!</b></p>
    ... </body>
    ... </html>
    ... """
    >>> for item in re.split(r"(<.*?>)", html):
    ...   print item,
    ...
    
    <html>
    <body>
    <p>  <b> Hey there! </b>  </p>
    </body>
    </html>
    >>>
  14. #8
  15. No Profile Picture
    I hate nerds
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2003
    Posts
    540
    Rep Power
    0
    hi guys thank you both for your help.

    neytan:
    the (?P<mygroupname>) specifies the group name so that you can call matcher.group('tagname') instead of matcher.group(1).

    ok, basically this is what i am doing. i am parsing an html or xml document and building the DOM tree.

    the part of my algorithm that is weak is just the parsing function(go figure!). this function takes a string of HTML or XML, and returns a tuple containing the name of the first tag encountered, the contents of that tag, and the index of the string where the tag ended. this function works for the most part, except for when the HTML is malformed like this

    "<a sdfs <b>sdfs</b>"
    "<!-- safsdafsadf <b>sadfasd</b>"
  16. #9
  17. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    hehe thats the problem with html sad, its just not strict enough!
    XML on the other hand is alot easier to parse because it is!

    Thanks for the info, i must have skipped that part in the regex tutorial , i think maybe you could fix that by specifying the tags name i.e.

    <[a|b|img|etc].*?>

    this way only valid html tags would matcj (or thats the theory)

    Mark.
    programming language development: www.netytan.com Hula

  18. #10
  19. No Profile Picture
    I hate nerds
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2003
    Posts
    540
    Rep Power
    0
    yeah i guess i could. but i dont understand why that backreference doesnt work.

    i have the definitive guide to python, but ti said nothing about backreferencing. do you have a link to a tutorial on it>?

    btw. another sall issue about dictionaries.
    im gonna post another thread. perhaps you could help? thanks
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2001
    Location
    Houston, TX
    Posts
    383
    Rep Power
    13
    Why not just use Python's standard HTML Parser?

    http://www.python.org/doc/current/li...TMLParser.html
  22. #12
  23. No Profile Picture
    I hate nerds
    Devshed Novice (500 - 999 posts)

    Join Date
    Jul 2003
    Posts
    540
    Rep Power
    0
    i cant. for reasons not too long to explain, i must write my own.

    thanks though.
  24. #13
  25. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,607
    Rep Power
    4247
    My attempt at a parser. Note that this doesn't use regexps at all, but is case insensitive . Also, it doesn't support attributes, but can be easily modified to handle them as well.
    Code:
    import string
    
    xml = """
    <html>
      <head><TiTLE>This is a test</title></HEAD>
      <body>
        This is some <b>bold and <i>italicized</i> text</b>
      </body>
    </html>
    """
    
    def xmlparse(input, output):
        i = 0
        i = string.find(input, "<", i)    
        while (i > -1):
            # First find the start tag
            j = string.find(input, ">", i)
            if j == -1:
                break
            starttag = input[i+1: j]
    
            # Now compute the end tag
            endtag = "</" + starttag + ">"
            k = string.find(string.lower(input), string.lower(endtag), i)
            if k == -1:
                break
    
            # Now figure out the text between the tags
            text = input[j + 1: k]
    
            # Add the tags/text to the list
            output.append(["<" + starttag + ">", text, endtag])
    
            # If the text has more tags within it, recurse into the function
            if (string.find(text, "<") > -1):
                xmlparse(text, output)
    
            # Find the next tag
            i = string.find(input, "<", k + 1)
    
    dom = []
    xmlparse(xml, dom)
    for item in dom:
        print "\n******* Item *********"
        print item
        # Can also split the item, if needed.
        # (opentag, text, closetag) = item
        # print "Tag = ", opentag
        #print "InnerText = ", text
        #print "Closetag = ", closetag
    [edit] Added comments[/b]
    Last edited by Scorpions4ever; November 24th, 2003 at 01:11 AM.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  26. #14
  27. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Posts
    28
    Rep Power
    0
    Scorpions4ever:
    My attempt at a parser...
    Nice code, the only problem is that it doesn't work with HTML - your parser assumes the document to be well-formed, which might be wrong with HTML.

    Try, for example, to parse the following snippet:
    Code:
    <html>
      <head><TiTLE>This is a HTML test</title></HEAD>
      <body>
        Line <p>
        More lines <br>
        <ul>
            <li>Item
            <li>Yet More Item
        </ul>
        End
      </body>
    </html>
    So, to make it working, one has eventually to invent poor-man's htmllib. What a pathetic task
    Last edited by Igor Pechersky; November 24th, 2003 at 10:28 AM.
  28. #15
  29. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,607
    Rep Power
    4247
    Not only that, it also doesn't work with nested tags, if the same tag is nested. For example, it will fail on this:
    <table>
    <tr>
    <td>
    <table>
    <tr><td>Some column</td></tr>
    </table>
    </td>
    </tr>
    </table>

    The above is well-formed, but the parser won't do it correctly. Actually, the Html Parser module is the real way to go, but I guess my routine might do the trick for sadmachine's purposes, since he doesn't want to install the python module.
    Last edited by Scorpions4ever; November 24th, 2003 at 01:25 PM.
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
Page 1 of 2 12 Last
  • Jump to page:

IMN logo majestic logo threadwatch logo seochat tools logo