March 23rd, 2014, 04:38 PM
Design My Own Web Archiving Software
I am investigating the possibility of designing my own program to solve a rather esoteric problem that I have not found a solution for - I am wondering whether it will be possible, and if anyone can validate my idea before I go through the efforts of working on it.
I am an avid web browser, I view on average 300 unique webpages per day. I also keep a LOT of tabs open across dozens of windows in 4 different browsers to keep track of things I don't want to forget, but don't want archived yet - my work flow requires them to stay 'in front of me' (which often results in them crashing, which is a problem I have tried in vain to fix, but mostly just tolerate, and have various addons like session manager which help me recover when they crash).
1. Related to this problem is that I want to be able to archive entire webpages so that if they are taken down at some point, I have a copy of them - this has happened a LOT to me in the past.
2. Also related is that I want to have a complete archived HISTORY record of every webpage/URL I have ever visited.
4. These 2 problems are solved only sparingly currently despite my extensive research. Browsers do not store a very full history (I have always found that pages are missing, I suspect it depends on whether they're accessed through cross-links, in webapplets, etc., etc.). Saving webpages of course is easily possible, but requires many clicks, and often the format of the page is destroyed when saved in HTML - the links also do not often work.
The BEST solution to this bookmarking/history/web archive problem I have found so far after MUCH work is PINBOARD.IN - this service allows you to one-click a page you're on, and it saves the link like a bookmark, BUT ALSO CRAWLS THE PAGE and archives it for you on its cloud storage (you pay for this component).
I like it, but it lacks two components - first you have to click the bookmarklet every time you're on a page (this is annoying and/or I forget and I don't want to have to click it on EVERY page I EVER visit). Secondly, it doesn't keep a record of EVERY page I visit, which is what I want.
I want to build a program (if it runs on desktop, browser, I don't care), that
1. Strips a COMPLETE URL history of every single URL visited on the computer - people have told me I can get this off of my router with some sort of URL logging.
2. I then want a crawler/spider program to comb every page in THAT URL record (every page my computer visits) and ARCHIVE a complete working copy of the page with the format, Links, etc. maintained. I don't care how much space this takes up. I do NOT need to save entire websites - just webpages visited.
3. ADDITIONAL FEATURES: I would like some ability to search/organize this archive of complete webpages. An additional feature would be the ability to 'map' or visualize the 'breadcrumb pathway' by which the pages were browsed by me (i.e. did I get there with a link from a google search, or a Wikipedia link, or... etc.)
Given that virtually all of these things are possible MANUALLY - I assume all of this is eminently possible given enough effort.
I have minimal (pathetic) programming experience, and have only written Perl scripts for bioinformatics work, but I am willing to learn whatever I need to learn to do this. I have similarly minimal knowledge of Python.
------Does anyone have any commentary, ideas, suggestions, or helpful advice that might aid me in this quest?
Advance thanks to any replies.
March 21st, 2016, 06:56 AM
Im not sure if you are asking for a component/addon that does this? Or if you want to build the compoent/addon yourself?
I don't know of any addons that does this for you.
But if you want to build it yourself, you should make an addon for the browser you are working on.
You need access to a lot of storage space, just fyi, and then everytime you run a url from the browser, it simply saves a copy of the displayed data on you machine locally.
This has the disadvantage, especially if the page is taken down, that all the information that was available will now be gone, and only the precise pages you visisted, with their displayed info is available.
You could develop a crawler, that simply starts from the base-url of every site you visit and then crawls through every page available from this site, this will store lots more data, likely a lot of it redudant, and you are still left with the static data from the time you actually viewed the page, but you will be able to gain some of the sites functionality, should you wish to view pages, you didn't actively view the first time you went there.