Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old December 4th, 2012, 04:30 PM
cbf28 cbf28 is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2012
Posts: 1 cbf28 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 35 m 12 sec
Reputation Power: 0
Web Crawler - Initial Design Question

Hi all,

I am an old programmer now making my return to DevShed - old habits die hard

I used to work with C++ under Linux but haven't programmed in a few years since moving more to design and management. However, have a business idea I want to try out.

I am looking to crawl a specific website and all its sub-domains and trawl it for keywords, and do a count of the occurence of that word under the domain, and store the result in a database.

For example , consider IBM's portal (a massive website), and I want to check how many webpages have the word "ThinkPad".

I have no idea where to start. Should I be looking at things like GNU Wget , Abot or what? Or, am I looking at writing a search engine? When you enter a word in google - it tells you the number of results and the time like "2,999 in 0.003second".

In simpleton terms - like running a command line to do a grep on a list of files and piping it into a wc (word count) - except i want to run it on a website and all its domains and files. I would like to be able to define my criteria in an XML file or rules file - something i can enhance and manipulate over time.

Where should I start?

Thanks,

cbf28

Last edited by cbf28 : December 4th, 2012 at 04:36 PM. Reason: More details

Reply With Quote
  #2  
Old December 4th, 2012, 06:14 PM
E-Oreo's Avatar
E-Oreo E-Oreo is offline
Lost in code
Dev Shed God 7th Plane (8000 - 8499 posts)
 
Join Date: Dec 2004
Posts: 8,063 E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)E-Oreo User rank is General 92nd Grade (Above 100000 Reputation Level)  Folding Points: 945 Folding Title: Novice Folder
Time spent in forums: 2 Months 1 Day 6 h 49 m 31 sec
Reputation Power: 7104
If the data you need is just a word count for each unique word under a domain, you could take your scripting / programming languages of choice and write a simple spider to crawls the site and creates a database table with: word | domain | count

You'd need a secondary table as well to keep track of which pages you've already crawled and possibly which words those pages contain if you want the ability to selectively update the index.

If you want to pull down an entire mirror of the site you could use something like httrack.

There is a massive difference between this and what Google does though. If you're trying to write a search engine and not just get a word count for individual words then you should look into something like SOLR as a back-end combined with a custom crawler to build the index.
__________________
PHP FAQ
How to program a basic, secure login system using PHP
Connect with me on LinkedIn


Quote:
Originally Posted by Spad
Ah USB, the only rectangular connector where you have to make 3 attempts before you get it the right way around

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Web Crawler - Initial Design Question

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap