January 6th, 2003, 05:16 PM
how to write a spider in python?
I need to write a program that will retrieve the text from web sites; I'd supply a list, it would get me all the text under the given URLs.
January 9th, 2003, 12:10 AM
You can use the urllib module to retrieve text from a web page.
The urlopen function returns the HTML code of the specified web page.
Example take from the Python manual:
January 9th, 2003, 05:42 AM
thank you so much!
I will check out the urllib in the manual.
I also need to know about following links from the web page; I assume I will find the info in the urllib module.
January 11th, 2003, 04:54 PM
The urllib wil only return the HTML of the url you feed it.
If you want to follow links from that webpage you can get them out with either regular expressions (re module) or a parser (HTMLParser module).
Unless you have a compelling reason I'd recommend ripping the urls out quick and dirty with a regular expression. You can then feed them to urllib again.