#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2005
    Posts
    1
    Rep Power
    0

    Find size of a web page


    What I realy want to do is check to make sure there have not been any updates to a web page. However not all pages include the last-modified header. So...
    I've been trying to find the size of a webpage to compare past and current sizes. Unfortunately, I keep running into issues. There's no convenient getsize() method and I tried using the http header 'content-len' but not every page requires it. Is there anyway I can find out the size of a webpage without needing to download it. OR is thera better way to verify if a webpage has been updated?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2004
    Posts
    461
    Rep Power
    25
    well, the only way i can't see to find out if a page has been updated will be to downlaod it and test size. You could md5 the page to make sure that it is exactly the same if it isn't then you know it has to be updated.

    Just with a bit of logic, i think if you are going to be downloading the page anyone to test if it has been updated then i think the best bet would be to just take the hole page and just show it again. THat way you don't use clock cycles to check for any updates when it could have already been done and displayed the content.

    However i am not sure if there isn't a better way. but i don't think there is.
  4. #3
  5. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    MD5 is the only real way to do this since as you mentioned, not all pages have the same headers . Still, there is a problem doing this with dynamic pages, particularly those that include dynamically generated adds since the page will be different every time but the content that you're actually interested in wont have changed.

    You could download the page and use a module like filecmp or difflib to find the exact changes. This could even make it possible for you to select the parts of the page you are interested in (on a per-change basis). Support for marking a page as Dynamic would also be handy .

    http://www.python.org/doc/2.4/lib/module-difflib.html
    http://www.python.org/doc/2.4/lib/module-filecmp.html

    It's a tricky problem to solve, but a good one if you can solve it well!

    Hope this helps,

    Mark.
    programming language development: www.netytan.com Hula

  6. #4
  7. Hello World :)
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Mar 2003
    Location
    Hull, UK
    Posts
    2,537
    Rep Power
    69
    You might be interested in this command line utility, it is written in Python and seems to work much as I sugested above .

    http://gertjan.freezope.org/g-jutils/g-urlmon/

    Mark.
    programming language development: www.netytan.com Hula


IMN logo majestic logo threadwatch logo seochat tools logo