March 1st, 2004, 08:47 PM
trying to create a webcrawler
i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated
here's the fragment of code that i'm mainly relying on
pattern = '<a href="(.+?)">'
links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read())
for l in links:
March 1st, 2004, 10:38 PM
just checking the extension of each link would probably work. if it doesn't have an extension (a link like python.org/search/) or the extension is in a list of valid extensions (which you would need to specify) then keep it as a valid link.
Originally Posted by 7imz
March 2nd, 2004, 05:14 AM
It's probably Better (and easier) to check against a list of common unwanted file types. This way you dont exclude a possibly valid page. You can then delete unwanted entried from the list using the del statment. Then obviously you're gonna need to store your results