
December 4th, 2012, 04:30 PM
|
|
Registered User
|
|
Join Date: Dec 2012
Posts: 1
Time spent in forums: 35 m 12 sec
Reputation Power: 0
|
|
|
Web Crawler - Initial Design Question
Hi all,
I am an old programmer now making my return to DevShed - old habits die hard
I used to work with C++ under Linux but haven't programmed in a few years since moving more to design and management. However, have a business idea I want to try out.
I am looking to crawl a specific website and all its sub-domains and trawl it for keywords, and do a count of the occurence of that word under the domain, and store the result in a database.
For example , consider IBM's portal (a massive website), and I want to check how many webpages have the word "ThinkPad".
I have no idea where to start. Should I be looking at things like GNU Wget , Abot or what? Or, am I looking at writing a search engine? When you enter a word in google - it tells you the number of results and the time like "2,999 in 0.003second".
In simpleton terms - like running a command line to do a grep on a list of files and piping it into a wc (word count) - except i want to run it on a website and all its domains and files. I would like to be able to define my criteria in an XML file or rules file - something i can enhance and manipulate over time.
Where should I start?
Thanks,
cbf28
Last edited by cbf28 : December 4th, 2012 at 04:36 PM.
Reason: More details
|