#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2013
    Posts
    1
    Rep Power
    0

    Need a little help with html parsing.


    So I am a noob at pretty much everything. Ill get that out of the way.

    I am looking at lxml as a way to manipulate html documents. Right now i am trying to do something simple: remove all of the <h1> headings from a html doc. Here is what i have so far:

    Code:
    import urllib2
    from sys import argv
    from lxml import etree
    import lxml.html
    
    n=0
    
    f = open ("testsite.html","r")
    data = f.read()
    
    h = lxml.html.fromstring(data)
    
    for hdr in h.xpath("//h1"):
        hdr.getparent().remove(hdr)
        print n
        n = n + 1
        
    print n
    print h
    What i can not seem to wrap my head around at this point is what variable the string with the edited file is stored in. (so i can write it back to another text file) Any recommendations?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2013
    Posts
    138
    Rep Power
    2
    Originally Posted by sona1111
    So I am a noob at pretty much everything. Ill get that out of the way.

    I am looking at lxml as a way to manipulate html documents. Right now i am trying to do something simple: remove all of the <h1> headings from a html doc. Here is what i have so far:

    Code:
    import urllib2
    from sys import argv
    from lxml import etree
    import lxml.html
    
    n=0
    
    f = open ("testsite.html","r")
    data = f.read()
    
    h = lxml.html.fromstring(data)
    
    for hdr in h.xpath("//h1"):
        hdr.getparent().remove(hdr)
        print n
        n = n + 1
        
    print n
    print h
    What i can not seem to wrap my head around at this point is what variable the string with the edited file is stored in. (so i can write it back to another text file) Any recommendations?
    The variable with the updated html tree is 'h' in your above example. But to print it to another file, you must first parse it into a text tring.

    Here is an updated version of your example, I've included the html data in the file, but it won't make a different if you read it from a file instead.

    Code:
    from lxml import etree
    from StringIO import StringIO
    
    data = """<html><head></head><body>                                                                                                   
    <h1>Heading 1</h1>                                                                                                                    
    <h2>Heading 2</h2>                                                                                                                    
    Text<br>                                                                                                                              
    More text<br>                                                                                                                         
    </body></html>"""
    
    parser = etree.HTMLParser()
    # Create a tree from the html text string                                                                                             
    tree = etree.parse(StringIO(data), parser)
    
    # Iterate over all '<h1>' elements and remove them                                                                                    
    for h in tree.xpath("//h1"):
        h.getparent().remove(h)
    
    # Finally, create a string representation of the tree again                                                                           
    text = etree.tostring(tree.getroot(), pretty_print=True, method="html")
    print text

IMN logo majestic logo threadwatch logo seochat tools logo