|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
|
|
Be the architects of evolution and help create the mobile internet future. It’s your move---enter to win here! |
|
#1
|
|||
|
|||
|
again httmlib
how can i parse a webpage so that i can get all the links on that webpage... is it something like this
import htmllib import string import urllib file = urllib.urlopen("http://www.python.org") html = file.read() file.close() p = htmllib.HTMLParser() p.feed(html) p.close() for v in p.anchorlist: print v (my problem is i've been learning python for 2 days only so this is all somewhat new to me) |
|
#2
|
||||
|
||||
|
never had much reason to use htmllib but i think you might need to call the achore_bgn() method to tell htmllib what you want to collect. Anyway here's an example using regex with urllib.
Code:
>>> import re, urllib
>>> re.findall('<a href="(.+?)">', urllib.urlopen('http://www.python.org/').read())
['./', './search/', './download/', './doc/', './Help.html', './dev/', './community/', './sigs/', 'doc/Summary.html', 'doc/faq/', '2.3.3/', 'doc/2.3.3/', '2.2.3/', 'doc/2.2.3/', 'download/download_mac.html', 'http://www.jython.org/', 'http://www.python.org/pypi', ...
>>>
Mark. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Python Programming > again httmlib |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|