#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2004
    Posts
    10
    Rep Power
    0

    trying to create a webcrawler


    i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated

    here's the fragment of code that i'm mainly relying on

    import re
    import urllib

    pattern = '<a href="(.+?)">'
    links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read())

    for l in links:
    print l
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2004
    Posts
    84
    Rep Power
    11
    Originally Posted by 7imz
    i'm trying to create a webcrawler where given a website i would follow all the links up to three levels... the good thing is i can do all of this but i want to ignore all the deadends (links leading to jpg files, pdf files... etc)... any suggestions would be greatly appreciated

    here's the fragment of code that i'm mainly relying on

    import re
    import urllib

    pattern = '<a href="(.+?)">'
    links = re.findall(pattern, urllib.urlopen('http://www.python.org/').read())

    for l in links:
    print l
    just checking the extension of each link would probably work. if it doesn't have an extension (a link like python.org/search/) or the extension is in a list of valid extensions (which you would need to specify) then keep it as a valid link.
  4. #3
  5. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    It's probably Better (and easier) to check against a list of common unwanted file types. This way you dont exclude a possibly valid page. You can then delete unwanted entried from the list using the del statment. Then obviously you're gonna need to store your results

    Mark.
    programming language development: www.netytan.com Hula


IMN logo majestic logo threadwatch logo seochat tools logo