The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.
|
 |
|
Dev Shed Forums
> Programming Languages
> Python Programming
|
Parsing HTML file help
Discuss Parsing HTML file help in the Python Programming forum on Dev Shed. Parsing HTML file help Python Programming forum discussing coding techniques, tips and tricks, and Zope related information. Python was designed from the ground up to be a completely object-oriented programming language.
|
|
 |
|
|
|
|

Dev Shed Forums Sponsor:
|
|
|

December 18th, 2003, 10:47 AM
|
|
Junior Member
|
|
Join Date: Oct 2001
Posts: 10
Time spent in forums: < 1 sec
Reputation Power: 0
|
|
|
Parsing HTML file help
I am trying to parse some static HTML files into a database. I need help with a regular expression in python that will extract content between specific tags.
example
<tag> info </tag>
<tag2> info <anothertag> info </anothertag></tag2>
I need to extract whats between <tag> and</tag> and assign it to a variable that will be used in the DB. I also need to extract whats between <tag2> and </tag2> while ignoring any other tags in between to insert this into the DB.
I have looked at HTML parsers but none of them are making any sense to me at the momment.
Thanks in advance,
|

December 18th, 2003, 04:14 PM
|
 |
Hello World :)
|
|
Join Date: Mar 2003
Location: Hull, UK
|
|
Ok not so hard and somthing that has popped up before. Anyway I'd use this regex with findall()
r'<(tag[0-9]*)>(.*?)</\1>'
Note: This assumes that your tags are named tag or tagnumber and not anyname inside angled brackets.
Code:
>>> import re
>>> regex = re.compile(r'<(tag[0-9]*)>(.*?)</\1>', re.M)
>>> string = r'''
<tag> info </tag>
<tag2> info <anothertag> info </anothertag></tag2>
'''
>>> regex.findall(string)
[('tag', ' info '), ('tag2', ' info <anothertag> info </anothertag>')]
>>>
Mark.
__________________
programming language development: www.netytan.com – Hula
|

December 21st, 2003, 10:05 AM
|
|
Contributing User
|
|
Join Date: Nov 2003
Posts: 35
Time spent in forums: < 1 sec
Reputation Power: 10
|
|
|
This is ugly....
Both of these are ugly (my doing--I haven't used either much), but they should help show some of the usage of HTMLParser and SGMLParser.
Code:
#!/usr/bin/python
from HTMLParser import HTMLParser
# create our own "parser"
class MyParser(HTMLParser):
def __init__(self):
# need to init HTMLParser, as well as add our
# own flags
HTMLParser.__init__(self)
# flags for our tag1 and tag 2
# set these just in case there are
# tags inside of tag/tag2 that has
# data that you want (as in your ex.)
self.TAG, self.TAG2 = 0, 0
# list to hold our tag/tag2 data
self.TAGDATA, self.TAG2DATA = [], []
# override the handle_starttag method
def handle_starttag(self, tag, attrs):
# in your example, we only have two
#cases we need to check against
# turn the tag's flag ON, so we
# know to capture the data
if tag == "tag":
self.TAG = 1
elif tag == "tag2":
self.TAG2 = 1
# override the handle_endtag method
# use this to turn the flag OFF, so
# we know to stop capturing data
def handle_endtag(self, tag):
if tag == "tag":
self.TAG = 0
elif tag == "tag2":
self.TAG2 = 0
# override the handle_data method
# we use the state of our flags to
# decide which tag's data we're retrieving
# this method gets called after every
def handle_data(self, data):
if self.TAG == 1:
self.TAGDATA.append(data)
elif self.TAG2 == 1:
self.TAG2DATA.append(data)
# create our own method, to print
# the data out
def tagprint(self):
print "TAG : ", "".join(self.TAGDATA)
print "TAG2 : ", "".join(self.TAG2DATA)
def main():
n = """<html><tag>info</tag>
<tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
parser = MyParser()
parser.feed(n)
parser.tagprint()
parser.close()
if __name__ == "__main__":
main()
You could also use sgmllib:
Code:
#!/usr/bin/python
from sgmllib import SGMLParser
class MyParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
# flags for our tag1 and tag 2
self.TAG, self.TAG2 = 0, 0
# holds our tag1 and tag2 data
self.TAGDATA, self.TAG2DATA = [], []
def start_tag(self, attrs):
self.TAG = 1
def end_tag(self):
self.TAG = 0
def start_tag2(self, attrs):
self.TAG2 = 1
def end_tag2(self):
self.TAG2 = 0
# additional tags would follow the format above
# i.e. start_TAGNAME, end_TAGNAME
# to capture the A tag, you could use
# start_a() and end_a()
def handle_data(self, data):
if self.TAG == 1:
self.TAGDATA.append(data)
if self.TAG2 == 1:
self.TAG2DATA.append(data)
def tagprint(self):
print "TAG : ", "".join(self.TAGDATA)
print "TAG2 : ", "".join(self.TAG2DATA)
def main():
n = """<html><tag>info</tag>
<tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
parser = MyParser()
parser.feed(n)
parser.tagprint()
parser.close()
if __name__ == "__main__":
main()
I prefer using sgmllib, but its up to you really (I don't know enough about both or your situation to say which, if either, is better). Both of the above examples capture the data inside "anothertag" which would take a little more complex processing using a regex. You'll either have to come up with a second regex to remove the "anothertag", or a more complex regex with lookaheads and backreferences to remove it.
Doing the second regex would be less code than either of the above two examples if you need to just remove all HTML in between tag 2:
Code:
>>> n = "info <anothertag> info </anothertag>"
>>> reg = re.compile("<.+?>")
>>> reg.sub("", n)
'info info '
>>>
If, however, you need a little bit more than grabbing two tags, the parser examples above may come in handy.
|
Developer Shed Advertisers and Affiliates
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Rate This Thread |
Linear Mode
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|