|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
trying to create a webcrawler
i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated
here's the fragment of code that i'm mainly relying on import re import urllib pattern = '<a href="(.+?)">' links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read()) for l in links: print l |
|
#2
|
|||
|
|||
|
Quote:
just checking the extension of each link would probably work. if it doesn't have an extension (a link like python.org/search/) or the extension is in a list of valid extensions (which you would need to specify) then keep it as a valid link. |
|
#3
|
||||
|
||||
|
It's probably Better (and easier) to check against a list of common unwanted file types. This way you dont exclude a possibly valid page. You can then delete unwanted entried from the list using the del statment. Then obviously you're gonna need to store your results
![]() Mark. |
![]() |
| Viewing: Dev Shed Forums > Programming Languages > Python Programming > trying to create a webcrawler |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|