July 30th, 2003, 10:28 AM
webcrawler in python
Hey everybody, I'm new here and pretty new to python... I need to write a small script that can fetch web pages and submit information to those webpages. Could anybody point me in the right direction please??
Thanks, your help is greatly appreciationed,
July 30th, 2003, 12:06 PM
Up the Irons
What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
"Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
Down with Sharon Osbourne
"I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
July 30th, 2003, 03:32 PM
I'd also check out urllib's urlopen which prvides an easy way of getting a web page (source).
I've got a small webcrawler that i never finished for you to have a look at but i'll post that latter.
August 1st, 2003, 10:40 AM
here's the script I wrote, it goes through a list of stored websites and checks if they have changed since they were last checked. If they have it stores the new MD5 value for that site to be checked against latter. The wesites can be viewed in the template. Not exacty what you want but the basic's are there.
Hope this helps,
Last edited by netytan; August 8th, 2003 at 11:39 AM.
August 3rd, 2003, 08:38 AM
August 8th, 2003, 10:09 AM
I'll take a look at your code netytan, and use python.org as a resource
August 8th, 2003, 11:54 AM
please dicregard the code i posted, it seams i posted the wrong version which doesn't actually work , this is just the template side of it.. sorry if i wasted any of you time.
I will try and find the right code for you a little latter tonight. In the mean time you can take a look at this code..
you will need to create md5.txt before running but it should, get the source code from any webpage sent to it. this is the converted to an MD5 checksum and stored for comparison. If the page has changed then the new checksum is stored and the 'Page has been changed' line will be outputted.
import urllib, md5
page = urllib.urlopen('http://australianit.news.com.au').read()
checksum = md5.new(page).digest()
if open('md5.txt', 'r').read().strip() != checksum:
print 'Page has been changed\n'
print 'Page has not been changed\n'
Hope this is of more help,