|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now! |
|
#1
|
|||
|
|||
|
Parsing HTML file help
I am trying to parse some static HTML files into a database. I need help with a regular expression in python that will extract content between specific tags.
example <tag> info </tag> <tag2> info <anothertag> info </anothertag></tag2> I need to extract whats between <tag> and</tag> and assign it to a variable that will be used in the DB. I also need to extract whats between <tag2> and </tag2> while ignoring any other tags in between to insert this into the DB. I have looked at HTML parsers but none of them are making any sense to me at the momment. Thanks in advance, |
|
#2
|
||||
|
||||
|
Ok not so hard and somthing that has popped up before. Anyway I'd use this regex with findall()
r'<(tag[0-9]*)>(.*?)</\1>' Note: This assumes that your tags are named tag or tagnumber and not anyname inside angled brackets. Code:
>>> import re
>>> regex = re.compile(r'<(tag[0-9]*)>(.*?)</\1>', re.M)
>>> string = r'''
<tag> info </tag>
<tag2> info <anothertag> info </anothertag></tag2>
'''
>>> regex.findall(string)
[('tag', ' info '), ('tag2', ' info <anothertag> info </anothertag>')]
>>>
Mark. |
|
#3
|
|||
|
|||
|
This is ugly....
Both of these are ugly (my doing--I haven't used either much), but they should help show some of the usage of HTMLParser and SGMLParser.
Code:
#!/usr/bin/python
from HTMLParser import HTMLParser
# create our own "parser"
class MyParser(HTMLParser):
def __init__(self):
# need to init HTMLParser, as well as add our
# own flags
HTMLParser.__init__(self)
# flags for our tag1 and tag 2
# set these just in case there are
# tags inside of tag/tag2 that has
# data that you want (as in your ex.)
self.TAG, self.TAG2 = 0, 0
# list to hold our tag/tag2 data
self.TAGDATA, self.TAG2DATA = [], []
# override the handle_starttag method
def handle_starttag(self, tag, attrs):
# in your example, we only have two
#cases we need to check against
# turn the tag's flag ON, so we
# know to capture the data
if tag == "tag":
self.TAG = 1
elif tag == "tag2":
self.TAG2 = 1
# override the handle_endtag method
# use this to turn the flag OFF, so
# we know to stop capturing data
def handle_endtag(self, tag):
if tag == "tag":
self.TAG = 0
elif tag == "tag2":
self.TAG2 = 0
# override the handle_data method
# we use the state of our flags to
# decide which tag's data we're retrieving
# this method gets called after every
def handle_data(self, data):
if self.TAG == 1:
self.TAGDATA.append(data)
elif self.TAG2 == 1:
self.TAG2DATA.append(data)
# create our own method, to print
# the data out
def tagprint(self):
print "TAG : ", "".join(self.TAGDATA)
print "TAG2 : ", "".join(self.TAG2DATA)
def main():
n = """<html><tag>info</tag>
<tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
parser = MyParser()
parser.feed(n)
parser.tagprint()
parser.close()
if __name__ == "__main__":
main()
You could also use sgmllib: Code:
#!/usr/bin/python
from sgmllib import SGMLParser
class MyParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
# flags for our tag1 and tag 2
self.TAG, self.TAG2 = 0, 0
# holds our tag1 and tag2 data
self.TAGDATA, self.TAG2DATA = [], []
def start_tag(self, attrs):
self.TAG = 1
def end_tag(self):
self.TAG = 0
def start_tag2(self, attrs):
self.TAG2 = 1
def end_tag2(self):
self.TAG2 = 0
# additional tags would follow the format above
# i.e. start_TAGNAME, end_TAGNAME
# to capture the A tag, you could use
# start_a() and end_a()
def handle_data(self, data):
if self.TAG == 1:
self.TAGDATA.append(data)
if self.TAG2 == 1:
self.TAG2DATA.append(data)
def tagprint(self):
print "TAG : ", "".join(self.TAGDATA)
print "TAG2 : ", "".join(self.TAG2DATA)
def main():
n = """<html><tag>info</tag>
<tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
parser = MyParser()
parser.feed(n)
parser.tagprint()
parser.close()
if __name__ == "__main__":
main()
I prefer using sgmllib, but its up to you really (I don't know enough about both or your situation to say which, if either, is better). Both of the above examples capture the data inside "anothertag" which would take a little more complex processing using a regex. You'll either have to come up with a second regex to remove the "anothertag", or a more complex regex with lookaheads and backreferences to remove it. Doing the second regex would be less code than either of the above two examples if you need to just remove all HTML in between tag 2: Code:
>>> n = "info <anothertag> info </anothertag>"
>>> reg = re.compile("<.+?>")
>>> reg.sub("", n)
'info info '
>>>
If, however, you need a little bit more than grabbing two tags, the parser examples above may come in handy. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Python Programming > Parsing HTML file help |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|