#1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    10
    Rep Power
    0

    webcrawler in python


    Hey everybody, I'm new here and pretty new to python... I need to write a small script that can fetch web pages and submit information to those webpages. Could anybody point me in the right direction please??

    Thanks, your help is greatly appreciationed,

    Corey

    edit: also I would like information on client session(how to use cookies client side)
  2. #2
  3. Banned ;)
    Devshed Supreme Being (6500+ posts)

    Join Date
    Nov 2001
    Location
    Woodland Hills, Los Angeles County, California, USA
    Posts
    9,592
    Rep Power
    4207
    Up the Irons
    What Would Jimi Do? Smash amps. Burn guitar. Take the groupies home.
    "Death Before Dishonour, my Friends!!" - Bruce D ickinson, Iron Maiden Aug 20, 2005 @ OzzFest
    Down with Sharon Osbourne

    "I wouldn't hire a butcher to fix my car. I also wouldn't hire a marketing firm to build my website." - Nilpo
  4. #3
  5. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    I'd also check out urllib's urlopen which prvides an easy way of getting a web page (source).

    I've got a small webcrawler that i never finished for you to have a look at but i'll post that latter.

    Have fun,
    Mark.
  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Hi,

    here's the script I wrote, it goes through a list of stored websites and checks if they have changed since they were last checked. If they have it stores the new MD5 value for that site to be checked against latter. The wesites can be viewed in the template. Not exacty what you want but the basic's are there.

    Hope this helps,

    Take care,
    Mark.
    Last edited by netytan; August 8th, 2003 at 11:39 AM.
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    133
    Rep Power
    11
    Also, have a look at HarvestMan and spider.
  10. #6
  11. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jul 2003
    Posts
    10
    Rep Power
    0
    Thanks!

    I'll take a look at your code netytan, and use python.org as a resource
  12. #7
  13. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    Hi Alpha,

    please dicregard the code i posted, it seams i posted the wrong version which doesn't actually work , this is just the template side of it.. sorry if i wasted any of you time.

    I will try and find the right code for you a little latter tonight. In the mean time you can take a look at this code..

    Code:
    #!/usr/bin/env python 
     
    import urllib, md5
    
    page = urllib.urlopen('http://australianit.news.com.au').read() 
    
    checksum = md5.new(page).digest()
    
    if open('md5.txt', 'r').read().strip() != checksum:
    	print 'Page has been changed\n'
    	open('md5.txt', 'w').write(checksum)
    else:
    	print 'Page has not been changed\n'
    you will need to create md5.txt before running but it should, get the source code from any webpage sent to it. this is the converted to an MD5 checksum and stored for comparison. If the page has changed then the new checksum is stored and the 'Page has been changed' line will be outputted.

    Hope this is of more help,
    Mark.

IMN logo majestic logo threadwatch logo seochat tools logo