#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2012
    Posts
    1
    Rep Power
    0

    Web Crawler - Initial Design Question


    Hi all,

    I am an old programmer now making my return to DevShed - old habits die hard

    I used to work with C++ under Linux but haven't programmed in a few years since moving more to design and management. However, have a business idea I want to try out.

    I am looking to crawl a specific website and all its sub-domains and trawl it for keywords, and do a count of the occurence of that word under the domain, and store the result in a database.

    For example , consider IBM's portal (a massive website), and I want to check how many webpages have the word "ThinkPad".

    I have no idea where to start. Should I be looking at things like GNU Wget , Abot or what? Or, am I looking at writing a search engine? When you enter a word in google - it tells you the number of results and the time like "2,999 in 0.003second".

    In simpleton terms - like running a command line to do a grep on a list of files and piping it into a wc (word count) - except i want to run it on a website and all its domains and files. I would like to be able to define my criteria in an XML file or rules file - something i can enhance and manipulate over time.

    Where should I start?

    Thanks,

    cbf28
    Last edited by cbf28; December 4th, 2012 at 04:36 PM. Reason: More details
  2. #2
  3. No Profile Picture
    Lost in code
    Devshed Supreme Being (6500+ posts)

    Join Date
    Dec 2004
    Posts
    8,301
    Rep Power
    7170
    If the data you need is just a word count for each unique word under a domain, you could take your scripting / programming languages of choice and write a simple spider to crawls the site and creates a database table with: word | domain | count

    You'd need a secondary table as well to keep track of which pages you've already crawled and possibly which words those pages contain if you want the ability to selectively update the index.

    If you want to pull down an entire mirror of the site you could use something like httrack.

    There is a massive difference between this and what Google does though. If you're trying to write a search engine and not just get a word count for individual words then you should look into something like SOLR as a back-end combined with a custom crawler to build the index.
    PHP FAQ

    Originally Posted by Spad
    Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around

IMN logo majestic logo threadwatch logo seochat tools logo