Python Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPython Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Get inside! Sample the range of functionality easily built with JMSL Library for Time Series Data Analysis, Heat Maps, Portfolio Optimization, Monte Carlo Simulation, Stock Price Charting and more. Download Now!
  #1  
Old December 18th, 2003, 10:47 AM
eddiembabaali eddiembabaali is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Oct 2001
Posts: 10 eddiembabaali User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Parsing HTML file help

I am trying to parse some static HTML files into a database. I need help with a regular expression in python that will extract content between specific tags.
example

<tag> info </tag>
<tag2> info <anothertag> info </anothertag></tag2>

I need to extract whats between <tag> and</tag> and assign it to a variable that will be used in the DB. I also need to extract whats between <tag2> and </tag2> while ignoring any other tags in between to insert this into the DB.

I have looked at HTML parsers but none of them are making any sense to me at the momment.


Thanks in advance,

Reply With Quote
  #2  
Old December 18th, 2003, 04:14 PM
netytan's Avatar
netytan netytan is offline
Hello World :)
Dev Shed Frequenter (2500 - 2999 posts)
 
Join Date: Mar 2003
Location: Hull, UK
Posts: 2,529 netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level)netytan User rank is Second Lieutenant (5000 - 10000 Reputation Level) 
Time spent in forums: 1 Week 2 Days 17 h 19 m 5 sec
Reputation Power: 63
Send a message via ICQ to netytan Send a message via AIM to netytan Send a message via MSN to netytan Send a message via Yahoo to netytan
Ok not so hard and somthing that has popped up before. Anyway I'd use this regex with findall()

r'<(tag[0-9]*)>(.*?)</\1>'

Note: This assumes that your tags are named tag or tagnumber and not anyname inside angled brackets.

Code:
>>> import re
>>> regex = re.compile(r'<(tag[0-9]*)>(.*?)</\1>', re.M)
>>> string = r'''
<tag> info </tag>
<tag2> info <anothertag> info </anothertag></tag2>
'''
>>> regex.findall(string)
[('tag', ' info '), ('tag2', ' info <anothertag> info </anothertag>')]
>>>


Mark.
__________________
programming language development: www.netytan.com Hula


Reply With Quote
  #3  
Old December 21st, 2003, 10:05 AM
oxygenthief oxygenthief is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2003
Posts: 35 oxygenthief User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 5
This is ugly....

Both of these are ugly (my doing--I haven't used either much), but they should help show some of the usage of HTMLParser and SGMLParser.

Code:
#!/usr/bin/python
from HTMLParser import HTMLParser

# create our own "parser"
class MyParser(HTMLParser):
    def __init__(self):
        # need to init HTMLParser, as well as add our
        # own flags
        HTMLParser.__init__(self)
        # flags for our tag1 and tag 2
        # set these just in case there are
        # tags inside of tag/tag2 that has
        # data that you want (as in your ex.)
        self.TAG, self.TAG2 = 0, 0
        # list to hold our tag/tag2 data
        self.TAGDATA, self.TAG2DATA = [], []
      # override the handle_starttag method
    def handle_starttag(self, tag, attrs):
        # in your example, we only have two
        #cases we need to check against
        # turn the tag's flag ON, so we
        # know to capture the data
        if tag == "tag":
            self.TAG = 1
        elif tag == "tag2":
            self.TAG2 = 1
    # override the handle_endtag method
    # use this to turn the flag OFF, so
    # we know to stop capturing data
    def handle_endtag(self, tag):
        if tag == "tag":
            self.TAG = 0
        elif tag == "tag2":
            self.TAG2 = 0
    # override the handle_data method
    # we use the state of our flags to   
    # decide which tag's data we're retrieving
    # this method gets called after every 
    def handle_data(self, data):
        if self.TAG == 1:
            self.TAGDATA.append(data)
        elif self.TAG2 == 1:
            self.TAG2DATA.append(data)
    # create our own method, to print
    # the data out
    def tagprint(self):
        print "TAG : ", "".join(self.TAGDATA)
        print "TAG2 : ", "".join(self.TAG2DATA)
def main():
    n = """<html><tag>info</tag>
<tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
    parser = MyParser()
    parser.feed(n)
    parser.tagprint()
    parser.close()

if __name__ == "__main__":
    main()

You could also use sgmllib:

Code:
#!/usr/bin/python
from sgmllib import SGMLParser

class MyParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        # flags for our tag1 and tag 2
        self.TAG, self.TAG2 = 0, 0
        # holds our tag1 and tag2 data
        self.TAGDATA, self.TAG2DATA = [], []
    def start_tag(self, attrs):
        self.TAG = 1
    def end_tag(self):
        self.TAG = 0
    def start_tag2(self, attrs):
        self.TAG2 = 1
    def end_tag2(self):
        self.TAG2 = 0
    # additional tags would follow the format above
    # i.e. start_TAGNAME, end_TAGNAME
    # to capture the A tag, you could use
    # start_a() and end_a()
    def handle_data(self, data):
        if self.TAG == 1:
            self.TAGDATA.append(data)
        if self.TAG2 == 1:
            self.TAG2DATA.append(data)
    def tagprint(self):
        print "TAG : ", "".join(self.TAGDATA)
        print "TAG2 : ", "".join(self.TAG2DATA)
def main():
    n = """<html><tag>info</tag>
<tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
    parser = MyParser()
    parser.feed(n)
    parser.tagprint()
    parser.close()

if __name__ == "__main__":
    main()


I prefer using sgmllib, but its up to you really (I don't know enough about both or your situation to say which, if either, is better). Both of the above examples capture the data inside "anothertag" which would take a little more complex processing using a regex. You'll either have to come up with a second regex to remove the "anothertag", or a more complex regex with lookaheads and backreferences to remove it.

Doing the second regex would be less code than either of the above two examples if you need to just remove all HTML in between tag 2:

Code:
>>> n = "info <anothertag> info </anothertag>"
>>> reg = re.compile("<.+?>")
>>> reg.sub("", n)
'info  info '
>>>


If, however, you need a little bit more than grabbing two tags, the parser examples above may come in handy.

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPython Programming > Parsing HTML file help


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway