
March 1st, 2004, 10:38 PM
|
|
Contributing User
|
|
Join Date: Jan 2004
Posts: 84
Time spent in forums: 8 h 7 m
Reputation Power: 10
|
|
Quote: | Originally Posted by 7imz i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated
here's the fragment of code that i'm mainly relying on
import re
import urllib
pattern = '<a href="(.+?)">'
links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read())
for l in links:
print l |
just checking the extension of each link would probably work. if it doesn't have an extension (a link like python.org/search/) or the extension is in a list of valid extensions (which you would need to specify) then keep it as a valid link.
|