#1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2001
    Posts
    10
    Rep Power
    0

    Parsing HTML file help


    I am trying to parse some static HTML files into a database. I need help with a regular expression in python that will extract content between specific tags.
    example

    <tag> info </tag>
    <tag2> info <anothertag> info </anothertag></tag2>

    I need to extract whats between <tag> and</tag> and assign it to a variable that will be used in the DB. I also need to extract whats between <tag2> and </tag2> while ignoring any other tags in between to insert this into the DB.

    I have looked at HTML parsers but none of them are making any sense to me at the momment.


    Thanks in advance,
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Ok not so hard and somthing that has popped up before. Anyway I'd use this regex with findall()

    r'<(tag[0-9]*)>(.*?)</\1>'

    Note: This assumes that your tags are named tag or tagnumber and not anyname inside angled brackets.

    Code:
    >>> import re
    >>> regex = re.compile(r'<(tag[0-9]*)>(.*?)</\1>', re.M)
    >>> string = r'''
    <tag> info </tag>
    <tag2> info <anothertag> info </anothertag></tag2>
    '''
    >>> regex.findall(string)
    [('tag', ' info '), ('tag2', ' info <anothertag> info </anothertag>')]
    >>>
    Mark.
    programming language development: www.netytan.com Hula

  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2003
    Posts
    35
    Rep Power
    11

    This is ugly....


    Both of these are ugly (my doing--I haven't used either much), but they should help show some of the usage of HTMLParser and SGMLParser.

    Code:
    #!/usr/bin/python
    from HTMLParser import HTMLParser
    
    # create our own "parser"
    class MyParser(HTMLParser):
        def __init__(self):
            # need to init HTMLParser, as well as add our
            # own flags
            HTMLParser.__init__(self)
            # flags for our tag1 and tag 2
            # set these just in case there are
            # tags inside of tag/tag2 that has
            # data that you want (as in your ex.)
            self.TAG, self.TAG2 = 0, 0
            # list to hold our tag/tag2 data
            self.TAGDATA, self.TAG2DATA = [], []
          # override the handle_starttag method
        def handle_starttag(self, tag, attrs):
            # in your example, we only have two
            #cases we need to check against
            # turn the tag's flag ON, so we
            # know to capture the data
            if tag == "tag":
                self.TAG = 1
            elif tag == "tag2":
                self.TAG2 = 1
        # override the handle_endtag method
        # use this to turn the flag OFF, so
        # we know to stop capturing data
        def handle_endtag(self, tag):
            if tag == "tag":
                self.TAG = 0
            elif tag == "tag2":
                self.TAG2 = 0
        # override the handle_data method
        # we use the state of our flags to   
        # decide which tag's data we're retrieving
        # this method gets called after every 
        def handle_data(self, data):
            if self.TAG == 1:
                self.TAGDATA.append(data)
            elif self.TAG2 == 1:
                self.TAG2DATA.append(data)
        # create our own method, to print
        # the data out
        def tagprint(self):
            print "TAG : ", "".join(self.TAGDATA)
            print "TAG2 : ", "".join(self.TAG2DATA)
    def main():
        n = """<html><tag>info</tag>
    <tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
        parser = MyParser()
        parser.feed(n)
        parser.tagprint()
        parser.close()
    
    if __name__ == "__main__":
        main()
    You could also use sgmllib:

    Code:
    #!/usr/bin/python
    from sgmllib import SGMLParser
    
    class MyParser(SGMLParser):
        def __init__(self):
            SGMLParser.__init__(self)
            # flags for our tag1 and tag 2
            self.TAG, self.TAG2 = 0, 0
            # holds our tag1 and tag2 data
            self.TAGDATA, self.TAG2DATA = [], []
        def start_tag(self, attrs):
            self.TAG = 1
        def end_tag(self):
            self.TAG = 0
        def start_tag2(self, attrs):
            self.TAG2 = 1
        def end_tag2(self):
            self.TAG2 = 0
        # additional tags would follow the format above
        # i.e. start_TAGNAME, end_TAGNAME
        # to capture the A tag, you could use
        # start_a() and end_a()
        def handle_data(self, data):
            if self.TAG == 1:
                self.TAGDATA.append(data)
            if self.TAG2 == 1:
                self.TAG2DATA.append(data)
        def tagprint(self):
            print "TAG : ", "".join(self.TAGDATA)
            print "TAG2 : ", "".join(self.TAG2DATA)
    def main():
        n = """<html><tag>info</tag>
    <tag2> info <anothertag>info</anothertag> some more data here</tag2></html>"""
        parser = MyParser()
        parser.feed(n)
        parser.tagprint()
        parser.close()
    
    if __name__ == "__main__":
        main()
    I prefer using sgmllib, but its up to you really (I don't know enough about both or your situation to say which, if either, is better). Both of the above examples capture the data inside "anothertag" which would take a little more complex processing using a regex. You'll either have to come up with a second regex to remove the "anothertag", or a more complex regex with lookaheads and backreferences to remove it.

    Doing the second regex would be less code than either of the above two examples if you need to just remove all HTML in between tag 2:

    Code:
    >>> n = "info <anothertag> info </anothertag>"
    >>> reg = re.compile("<.+?>")
    >>> reg.sub("", n)
    'info  info '
    >>>
    If, however, you need a little bit more than grabbing two tags, the parser examples above may come in handy.

IMN logo majestic logo threadwatch logo seochat tools logo