December 4th, 2012, 05:30 PM
Web Crawler - Initial Design Question
I am an old programmer now making my return to DevShed - old habits die hard
I used to work with C++ under Linux but haven't programmed in a few years since moving more to design and management. However, have a business idea I want to try out.
I am looking to crawl a specific website and all its sub-domains and trawl it for keywords, and do a count of the occurence of that word under the domain, and store the result in a database.
For example , consider IBM's portal (a massive website), and I want to check how many webpages have the word "ThinkPad".
I have no idea where to start. Should I be looking at things like GNU Wget , Abot or what? Or, am I looking at writing a search engine? When you enter a word in google - it tells you the number of results and the time like "2,999 in 0.003second".
In simpleton terms - like running a command line to do a grep on a list of files and piping it into a wc (word count) - except i want to run it on a website and all its domains and files. I would like to be able to define my criteria in an XML file or rules file - something i can enhance and manipulate over time.
Where should I start?
Last edited by cbf28; December 4th, 2012 at 05:36 PM.
Reason: More details
December 4th, 2012, 07:14 PM
If the data you need is just a word count for each unique word under a domain, you could take your scripting / programming languages of choice and write a simple spider to crawls the site and creates a database table with: word | domain | count
You'd need a secondary table as well to keep track of which pages you've already crawled and possibly which words those pages contain if you want the ability to selectively update the index.
If you want to pull down an entire mirror of the site you could use something like httrack.
There is a massive difference between this and what Google does though. If you're trying to write a search engine and not just get a word count for individual words then you should look into something like SOLR as a back-end combined with a custom crawler to build the index.