Thread: again httmlib

    #1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    10
    Rep Power
    0

    again httmlib


    how can i parse a webpage so that i can get all the links on that webpage... is it something like this

    import htmllib
    import string
    import urllib

    file = urllib.urlopen("http://www.python.org")
    html = file.read()
    file.close()

    p = htmllib.HTMLParser()
    p.feed(html)
    p.close()

    for v in p.anchorlist:
    print v

    (my problem is i've been learning python for 2 days only so this is all somewhat new to me)
  2. #2
  3. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    never had much reason to use htmllib but i think you might need to call the achore_bgn() method to tell htmllib what you want to collect. Anyway here's an example using regex with urllib.

    Code:
    >>> import re, urllib
    >>> re.findall('<a href="(.+?)">', urllib.urlopen('http://www.python.org/').read())
    ['./', './search/', './download/', './doc/', './Help.html', './dev/', './community/', './sigs/', 'doc/Summary.html', 'doc/faq/', '2.3.3/', 'doc/2.3.3/', '2.2.3/', 'doc/2.2.3/', 'download/download_mac.html', 'http://www.jython.org/', 'http://www.python.org/pypi', ...
    >>>
    Mark.
    programming language development: www.netytan.com Hula


IMN logo majestic logo threadwatch logo seochat tools logo