Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old November 19th, 2003, 11:33 AM
sad.machine sad.machine is offline
I hate nerds
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Posts: 540 sad.machine Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 21 h 28 m 44 sec
Reputation Power: 0
regular expression for html

i need to extract tags and their contents, using any method. i think a good way is using re, so here goes.

r'<[a-zA-Z]+/W*>/W*</[a-zA-Z]>'

i THINK /W means everything under the sun including spaces...but im not sure. if its not, please let me know.

one problem here that i can think of right away is that this will work

<a>sdfs</b>

is there a way that i can specify that a certain group must equal a different group?

Reply With Quote
  #2  
Old November 19th, 2003, 04:51 PM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,537 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 18 h 17 m 47 sec
Reputation Power: 68
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
I think the trick with this one is to use backreferances although the closest i've got was this regex:

'<(.+?).*?>(.+?)</\\1>'

with findall(), however this wont handle nested tags at all so would have to be called on your results untill all the data is retrieved..

What about removing all the html and parsing the results? It really depends on what your after..

Period matches everything except newlines and you can make it match even them using the re.DOTALL flag when you compile your regex (assuming your compiling them). \W matches any none alphanumeric char.

Although for parsing html i'd suggest you use one of Pythons built-in modules since parsing html can be quite tricky! Which is part of the reason xhtml was created!

http://www.python.org/doc/current/l...le-htmllib.html
http://www.python.org/doc/current/l...HTMLParser.html

Mark.
__________________
programming language development: www.netytan.com Hula


Reply With Quote
  #3  
Old November 19th, 2003, 06:24 PM
sad.machine sad.machine is offline
I hate nerds
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Posts: 540 sad.machine Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 21 h 28 m 44 sec
Reputation Power: 0
im new to python (just started a few days ago) but i think what u wrote is all i need. thanks!

all i need to do is to have a regex to find the first tag in a string of text and to extract the tag type, and everything in between that tag, including nested tags.

'<(?P<tagname>.+?).*?>(?P<contents>.+?)</\\1>'

i assume that \\1 references the first group?

wouldn't '\\1' be a re for \1?

aren't html tags limited to starting with letters? i dont think there is a tag starting with a number. could be wrong though

Reply With Quote
  #4  
Old November 19th, 2003, 07:04 PM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,537 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 18 h 17 m 47 sec
Reputation Power: 68
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
Welcome to the Python world sad, i think you'll like it ..it sounds pretty simple so regex should work fine for this one.

To answer your question yes \\1 was a referance too the first group (in this case the tags name). You could use r'' but i like to see what i'm doing when it comes to with regex

Nope, there arent any html tags witch start with a numer but you do have things like H1 etc.

Let me know how it goes

Mark.

Reply With Quote
  #5  
Old November 20th, 2003, 01:45 AM
sad.machine sad.machine is offline
I hate nerds
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Posts: 540 sad.machine Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 21 h 28 m 44 sec
Reputation Power: 0
thanks mark. perhaps u could help me again...somethings weird.
this is my re pattern.

htmlparse = r'(?P<tag><(?P<tagname>[!a-zA-Z][^\s>-]*)[^>]*>((?P<contents>.*)</\2>)?)|(?P<textnode>\s*[^<\s][^<]*\s*)'

basically i check first for an html tag, taking into account that tags can be like

<meta>
<b></b>

the second part takes into account text nodes because im trying to extract the DOM model.

anyway...here is one instance it does not work

"<!-- <b></b><meta></html>"

('<!-- <b>', '!', None, None, None)

the first group is 'tag', and it shows a match for '<!-- <b>', but how can that be?
the second group, tagname is '!' and with that backreference there should be no match since there is no instance of '</!>'

Reply With Quote
  #6  
Old November 20th, 2003, 02:23 AM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,537 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 18 h 17 m 47 sec
Reputation Power: 68
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
Yeah i'd be happy to help , u r gonna have to explain something though.. what is with all the ?P's and <tagname>'s, i've never seem them before, i assumed they where just some kind of comment but i'm not too sure now

Anyway do you have an example of that you want to parse so we can work out the best way to do this?

Mark.

Reply With Quote
  #7  
Old November 20th, 2003, 09:26 AM
percivall percivall is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Jul 2003
Posts: 133 percivall User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 10
There is a nice way to use the re.split function with html code. It's a bit simple, but it can be useful.

Normally, split will remove the token on which you split, but if you surround it with parentheses, it will include it.
Code:
>>> html = """
... <html>
... <body>
... <p><b>Hey there!</b></p>
... </body>
... </html>
... """
>>> for item in re.split(r"(<.*?>)", html):
...   print item,
...

<html>
<body>
<p>  <b> Hey there! </b>  </p>
</body>
</html>
>>>

Reply With Quote
  #8  
Old November 20th, 2003, 12:31 PM
sad.machine sad.machine is offline
I hate nerds
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Posts: 540 sad.machine Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 21 h 28 m 44 sec
Reputation Power: 0
hi guys thank you both for your help.

neytan:
the (?P<mygroupname>) specifies the group name so that you can call matcher.group('tagname') instead of matcher.group(1).

ok, basically this is what i am doing. i am parsing an html or xml document and building the DOM tree.

the part of my algorithm that is weak is just the parsing function(go figure!). this function takes a string of HTML or XML, and returns a tuple containing the name of the first tag encountered, the contents of that tag, and the index of the string where the tag ended. this function works for the most part, except for when the HTML is malformed like this

"<a sdfs <b>sdfs</b>"
"<!-- safsdafsadf <b>sadfasd</b>"

Reply With Quote
  #9  
Old November 20th, 2003, 01:01 PM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,537 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 18 h 17 m 47 sec
Reputation Power: 68
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
hehe thats the problem with html sad, its just not strict enough!
XML on the other hand is alot easier to parse because it is!

Thanks for the info, i must have skipped that part in the regex tutorial , i think maybe you could fix that by specifying the tags name i.e.

<[a|b|img|etc].*?>

this way only valid html tags would matcj (or thats the theory)

Mark.

Reply With Quote
  #10  
Old November 21st, 2003, 12:45 PM
sad.machine sad.machine is offline
I hate nerds
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Posts: 540 sad.machine Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 21 h 28 m 44 sec
Reputation Power: 0
yeah i guess i could. but i dont understand why that backreference doesnt work.

i have the definitive guide to python, but ti said nothing about backreferencing. do you have a link to a tutorial on it>?

btw. another sall issue about dictionaries.
im gonna post another thread. perhaps you could help? thanks

Reply With Quote
  #11  
Old November 23rd, 2003, 11:41 PM
Strike Strike is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2001
Location: Houston, TX
Posts: 383 Strike User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 h 41 m 27 sec
Reputation Power: 12
Send a message via ICQ to Strike Send a message via AIM to Strike Send a message via Yahoo to Strike
Why not just use Python's standard HTML Parser?

http://www.python.org/doc/current/l...HTMLParser.html

Reply With Quote
  #12  
Old November 24th, 2003, 12:01 AM
sad.machine sad.machine is offline
I hate nerds
Dev Shed Novice (500 - 999 posts)
 
Join Date: Jul 2003
Posts: 540 sad.machine Negative: is most likely a SPAMMER and a traitor to the cause. 
Time spent in forums: 21 h 28 m 44 sec
Reputation Power: 0
i cant. for reasons not too long to explain, i must write my own.

thanks though.

Reply With Quote
  #13  
Old November 24th, 2003, 01:00 AM
Scorpions4ever's Avatar
Scorpions4ever Scorpions4ever is offline
Banned ;)
Dev Shed God 9th Plane (9000 - 9499 posts)
 
Join Date: Nov 2001
Location: Woodland Hills, Los Angeles County, California, USA
Posts: 9,385 Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level) 
Time spent in forums: 1 Month 4 Weeks 1 Day 21 h 30 m 25 sec
Reputation Power: 4080
My attempt at a parser. Note that this doesn't use regexps at all, but is case insensitive . Also, it doesn't support attributes, but can be easily modified to handle them as well.
Code:
import string

xml = """
<html>
  <head><TiTLE>This is a test</title></HEAD>
  <body>
    This is some <b>bold and <i>italicized</i> text</b>
  </body>
</html>
"""

def xmlparse(input, output):
    i = 0
    i = string.find(input, "<", i)    
    while (i > -1):
        # First find the start tag
        j = string.find(input, ">", i)
        if j == -1:
            break
        starttag = input[i+1: j]

        # Now compute the end tag
        endtag = "</" + starttag + ">"
        k = string.find(string.lower(input), string.lower(endtag), i)
        if k == -1:
            break

        # Now figure out the text between the tags
        text = input[j + 1: k]

        # Add the tags/text to the list
        output.append(["<" + starttag + ">", text, endtag])

        # If the text has more tags within it, recurse into the function
        if (string.find(text, "<") > -1):
            xmlparse(text, output)

        # Find the next tag
        i = string.find(input, "<", k + 1)

dom = []
xmlparse(xml, dom)
for item in dom:
    print "\n******* Item *********"
    print item
    # Can also split the item, if needed.
    # (opentag, text, closetag) = item
    # print "Tag = ", opentag
    #print "InnerText = ", text
    #print "Closetag = ", closetag


[edit] Added comments[/b]
__________________
Up the Irons
What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
"Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
Down with Sharon Osbourne

Last edited by Scorpions4ever : November 24th, 2003 at 01:11 AM.

Reply With Quote
  #14  
Old November 24th, 2003, 10:23 AM
Igor Pechersky Igor Pechersky is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Posts: 28 Igor Pechersky User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Quote:
Scorpions4ever:
My attempt at a parser...
Nice code, the only problem is that it doesn't work with HTML - your parser assumes the document to be well-formed, which might be wrong with HTML.

Try, for example, to parse the following snippet:
Code:
<html>
  <head><TiTLE>This is a HTML test</title></HEAD>
  <body>
    Line <p>
    More lines <br>
    <ul>
        <li>Item
        <li>Yet More Item
    </ul>
    End
  </body>
</html>


So, to make it working, one has eventually to invent poor-man's htmllib. What a pathetic task

Last edited by Igor Pechersky : November 24th, 2003 at 10:28 AM.

Reply With Quote
  #15  
Old November 24th, 2003, 11:49 AM
Scorpions4ever's Avatar
Scorpions4ever Scorpions4ever is offline
Banned ;)
Dev Shed God 9th Plane (9000 - 9499 posts)
 
Join Date: Nov 2001
Location: Woodland Hills, Los Angeles County, California, USA
Posts: 9,385 Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level)Scorpions4ever User rank is General 46th Grade (Above 100000 Reputation Level) 
Time spent in forums: 1 Month 4 Weeks 1 Day 21 h 30 m 25 sec
Reputation Power: 4080
Not only that, it also doesn't work with nested tags, if the same tag is nested. For example, it will fail on this:
<table>
<tr>
<td>
<table>
<tr><td>Some column</td></tr>
</table>
</td>
</tr>
</table>

The above is well-formed, but the parser won't do it correctly. Actually, the Html Parser module is the real way to go, but I guess my routine might do the trick for sadmachine's purposes, since he doesn't want to install the python module.

Last edited by Scorpions4ever : November 24th, 2003 at 01:25 PM.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > regular expression for html

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap