|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
regular expression for html
i need to extract tags and their contents, using any method. i think a good way is using re, so here goes.
r'<[a-zA-Z]+/W*>/W*</[a-zA-Z]>' i THINK /W means everything under the sun including spaces...but im not sure. if its not, please let me know. one problem here that i can think of right away is that this will work <a>sdfs</b> is there a way that i can specify that a certain group must equal a different group? |
|
#2
|
||||
|
||||
|
I think the trick with this one is to use backreferances although the closest i've got was this regex:
'<(.+?).*?>(.+?)</\\1>' with findall(), however this wont handle nested tags at all so would have to be called on your results untill all the data is retrieved.. What about removing all the html and parsing the results? It really depends on what your after.. Period matches everything except newlines and you can make it match even them using the re.DOTALL flag when you compile your regex (assuming your compiling them). \W matches any none alphanumeric char. Although for parsing html i'd suggest you use one of Pythons built-in modules since parsing html can be quite tricky! Which is part of the reason xhtml was created! http://www.python.org/doc/current/l...le-htmllib.html http://www.python.org/doc/current/l...HTMLParser.html Mark. |
|
#3
|
|||
|
|||
|
im new to python (just started a few days ago) but i think what u wrote is all i need. thanks!
all i need to do is to have a regex to find the first tag in a string of text and to extract the tag type, and everything in between that tag, including nested tags. '<(?P<tagname>.+?).*?>(?P<contents>.+?)</\\1>' i assume that \\1 references the first group? wouldn't '\\1' be a re for \1? aren't html tags limited to starting with letters? i dont think there is a tag starting with a number. could be wrong though |
|
#4
|
||||
|
||||
|
Welcome to the Python world sad, i think you'll like it
..it sounds pretty simple so regex should work fine for this one.To answer your question yes \\1 was a referance too the first group (in this case the tags name). You could use r'' but i like to see what i'm doing when it comes to with regex ![]() Nope, there arent any html tags witch start with a numer but you do have things like H1 etc. Let me know how it goes ![]() Mark. |
|
#5
|
|||
|
|||
|
thanks mark. perhaps u could help me again...somethings weird.
this is my re pattern. htmlparse = r'(?P<tag><(?P<tagname>[!a-zA-Z][^\s>-]*)[^>]*>((?P<contents>.*)</\2>)?)|(?P<textnode>\s*[^<\s][^<]*\s*)' basically i check first for an html tag, taking into account that tags can be like <meta> <b></b> the second part takes into account text nodes because im trying to extract the DOM model. anyway...here is one instance it does not work "<!-- <b></b><meta></html>" ('<!-- <b>', '!', None, None, None) the first group is 'tag', and it shows a match for '<!-- <b>', but how can that be? the second group, tagname is '!' and with that backreference there should be no match since there is no instance of '</!>' |
|
#6
|
||||
|
||||
|
Yeah i'd be happy to help
, u r gonna have to explain something though.. what is with all the ?P's and <tagname>'s, i've never seem them before, i assumed they where just some kind of comment but i'm not too sure now ![]() Anyway do you have an example of that you want to parse so we can work out the best way to do this? Mark. |
|
#7
|
|||
|
|||
|
There is a nice way to use the re.split function with html code. It's a bit simple, but it can be useful.
Normally, split will remove the token on which you split, but if you surround it with parentheses, it will include it. Code:
>>> html = """ ... <html> ... <body> ... <p><b>Hey there!</b></p> ... </body> ... </html> ... """ >>> for item in re.split(r"(<.*?>)", html): ... print item, ... <html> <body> <p> <b> Hey there! </b> </p> </body> </html> >>> |
|
#8
|
|||
|
|||
|
hi guys thank you both for your help.
neytan: the (?P<mygroupname>) specifies the group name so that you can call matcher.group('tagname') instead of matcher.group(1). ok, basically this is what i am doing. i am parsing an html or xml document and building the DOM tree. the part of my algorithm that is weak is just the parsing function(go figure!). this function takes a string of HTML or XML, and returns a tuple containing the name of the first tag encountered, the contents of that tag, and the index of the string where the tag ended. this function works for the most part, except for when the HTML is malformed like this "<a sdfs <b>sdfs</b>" "<!-- safsdafsadf <b>sadfasd</b>" |
|
#9
|
||||
|
||||
|
hehe thats the problem with html sad, its just not strict enough!
XML on the other hand is alot easier to parse because it is! Thanks for the info, i must have skipped that part in the regex tutorial , i think maybe you could fix that by specifying the tags name i.e.<[a|b|img|etc].*?> this way only valid html tags would matcj (or thats the theory) Mark. |
|
#10
|
|||
|
|||
|
yeah i guess i could. but i dont understand why that backreference doesnt work.
i have the definitive guide to python, but ti said nothing about backreferencing. do you have a link to a tutorial on it>? btw. another sall issue about dictionaries. im gonna post another thread. perhaps you could help? thanks |
|
#11
|
|||
|
|||
|
Why not just use Python's standard HTML Parser?
http://www.python.org/doc/current/l...HTMLParser.html |
|
#12
|
|||
|
|||
|
i cant. for reasons not too long to explain, i must write my own.
thanks though. |
|
#13
|
||||
|
||||
|
My attempt at a parser. Note that this doesn't use regexps at all, but is case insensitive
. Also, it doesn't support attributes, but can be easily modified to handle them as well.Code:
import string
xml = """
<html>
<head><TiTLE>This is a test</title></HEAD>
<body>
This is some <b>bold and <i>italicized</i> text</b>
</body>
</html>
"""
def xmlparse(input, output):
i = 0
i = string.find(input, "<", i)
while (i > -1):
# First find the start tag
j = string.find(input, ">", i)
if j == -1:
break
starttag = input[i+1: j]
# Now compute the end tag
endtag = "</" + starttag + ">"
k = string.find(string.lower(input), string.lower(endtag), i)
if k == -1:
break
# Now figure out the text between the tags
text = input[j + 1: k]
# Add the tags/text to the list
output.append(["<" + starttag + ">", text, endtag])
# If the text has more tags within it, recurse into the function
if (string.find(text, "<") > -1):
xmlparse(text, output)
# Find the next tag
i = string.find(input, "<", k + 1)
dom = []
xmlparse(xml, dom)
for item in dom:
print "\n******* Item *********"
print item
# Can also split the item, if needed.
# (opentag, text, closetag) = item
# print "Tag = ", opentag
#print "InnerText = ", text
#print "Closetag = ", closetag
[edit] Added comments[/b]
__________________
Up the Irons What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home. "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest Down with Sharon Osbourne Puzzle of the Month solved by Keath and KevinADC, superior perl programmers of the month Looking for a perl job with kick-*** programmers in a well-known NASDAQ listed tech company with branches in the US and Europe? We're hiring. PM me for details. Requirements Last edited by Scorpions4ever : November 24th, 2003 at 01:11 AM. |
|
#14
|
|||
|
|||
|
Quote:
Try, for example, to parse the following snippet: Code:
<html>
<head><TiTLE>This is a HTML test</title></HEAD>
<body>
Line <p>
More lines <br>
<ul>
<li>Item
<li>Yet More Item
</ul>
End
</body>
</html>
So, to make it working, one has eventually to invent poor-man's htmllib. What a pathetic task ![]() Last edited by Igor Pechersky : November 24th, 2003 at 10:28 AM. |