January 27th, 2005, 07:34 PM
Find size of a web page
What I realy want to do is check to make sure there have not been any updates to a web page. However not all pages include the last-modified header. So...
I've been trying to find the size of a webpage to compare past and current sizes. Unfortunately, I keep running into issues. There's no convenient getsize() method and I tried using the http header 'content-len' but not every page requires it. Is there anyway I can find out the size of a webpage without needing to download it. OR is thera better way to verify if a webpage has been updated?
January 27th, 2005, 10:31 PM
well, the only way i can't see to find out if a page has been updated will be to downlaod it and test size. You could md5 the page to make sure that it is exactly the same if it isn't then you know it has to be updated.
Just with a bit of logic, i think if you are going to be downloading the page anyone to test if it has been updated then i think the best bet would be to just take the hole page and just show it again. THat way you don't use clock cycles to check for any updates when it could have already been done and displayed the content.
However i am not sure if there isn't a better way. but i don't think there is.
January 28th, 2005, 10:36 AM
MD5 is the only real way to do this since as you mentioned, not all pages have the same headers . Still, there is a problem doing this with dynamic pages, particularly those that include dynamically generated adds since the page will be different every time but the content that you're actually interested in wont have changed.
You could download the page and use a module like filecmp or difflib to find the exact changes. This could even make it possible for you to select the parts of the page you are interested in (on a per-change basis). Support for marking a page as Dynamic would also be handy .
It's a tricky problem to solve, but a good one if you can solve it well!
Hope this helps,
January 30th, 2005, 02:51 PM
You might be interested in this command line utility, it is written in Python and seems to work much as I sugested above .