#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2003
    Location
    el paso, texas
    Posts
    9
    Rep Power
    0

    how to write a spider in python?


    I need to write a program that will retrieve the text from web sites; I'd supply a list, it would get me all the text under the given URLs.
  2. #2
  3. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2002
    Posts
    3
    Rep Power
    0
    You can use the urllib module to retrieve text from a web page.
    The urlopen function returns the HTML code of the specified web page.

    Example take from the Python manual:

    import urllib
    params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
    f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
    print f.read()
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2003
    Location
    el paso, texas
    Posts
    9
    Rep Power
    0
    thank you so much!

    I will check out the urllib in the manual.

    I also need to know about following links from the web page; I assume I will find the info in the urllib module.
  6. #4
  7. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2001
    Location
    Delft
    Posts
    1
    Rep Power
    0
    The urllib wil only return the HTML of the url you feed it.

    If you want to follow links from that webpage you can get them out with either regular expressions (re module) or a parser (HTMLParser module).

    Unless you have a compelling reason I'd recommend ripping the urls out quick and dirty with a regular expression. You can then feed them to urllib again.

IMN logo majestic logo threadwatch logo seochat tools logo