Dev Shed Lounge
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsOtherDev Shed Lounge

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
Lose your application development headaches. Start developing and deploying applications with Advantage Database Server today. Download a 30-day trial for Free!
  #1  
Old June 11th, 2001, 04:17 PM
Karate_Chick Karate_Chick is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2000
Posts: 12 Karate_Chick User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Question Internet Spiders

Does anyone know how an Internet Spider works? I understand the whole concept that the spider goes out and "crawls" the internet sites and looks for meta tag, html code, and things of that nature. I know that it follows links from one page to another, depending on how deep the "spider" software is told to search. My question is how does it know where to search? How does it acutally find the pages?

Reply With Quote
  #2  
Old June 14th, 2001, 10:33 AM
Pressly Pressly is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2001
Posts: 48 Pressly User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 7
Talking My Best Guess

Since URL's are just convenient handles that reslove to numeric IP addresses, I would guess the spider starts with a valid IP number and just incrementally steps through the sequence of possible addresses until it finds a page, then parses the text for more URL's to follow. When it's seen all it can see, it can start back at the next IP number in the sequence.

Reply With Quote
  #3  
Old June 14th, 2001, 10:47 AM
rod k rod k is offline
Apprentice Deity
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Jul 1999
Location: Niagara Falls (On the wrong side of the gorge)
Posts: 3,237 rod k User rank is Private First Class (20 - 50 Reputation Level)rod k User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 4 m 8 sec
Reputation Power: 12
Send a message via AIM to rod k
That sounds efficient on the surface, Pressly, but it's not. What if no machine sits at that IP? What if (like my home computer) I have a firewall that won't respond to an unapproved request? Then your process hangs there waiting for a response until it times out.

Also, many firewall systems will treat that as a hack attempt.

Not to mention that the majority of IPs do NOT have an HTTP server.

All this leads up to your server wasting a lot of resources doing nothing.

KC,

As well as following internal website links, spiders will also harvest external links and follow those as well. Once it gets started a spider could theoretically crawl forever following site links to site links.

Of course, you'll have to give it a list of URLs to start, try to pick link rich sites.
__________________
FSBO (For Sale By Owner) Realty

Reply With Quote
  #4  
Old June 16th, 2001, 10:27 PM
bumperbox bumperbox is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2001
Location: Tauranga, NZ
Posts: 349 bumperbox User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 8
Try this site, it has some good info on spiders
http://www.robotstxt.org/wc/robots.html

Reply With Quote
  #5  
Old June 18th, 2001, 07:34 AM
Karate_Chick Karate_Chick is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2000
Posts: 12 Karate_Chick User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Cool

Thanks for the site, bumperbox.

Reply With Quote
Reply

Viewing: Dev Shed ForumsOtherDev Shed Lounge > Internet Spiders


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

 Free IT White Papers!
 
Accelerating Trading Partner Performance
One in five. That's how many partner transactions have at least one error. That is an amazing statistic, particularly given the extraordinary leaps in innovation across the global supply chain during the past two decades. Download this white paper to learn more.

 
Competing on Analytics
This Tech Analysis is designed to help identify characteristics shared by analytics competitors, and includes information about 32 organizations that have made a commitment to quantitative, fact-based analysis.

 
Cost Effective Scaling with Virtualization and Coyote Point Systems
An overview of the industry trend toward virtualization, how server consolidation has increased the importance of application uptime and the steps being taken to integrate load balancing technology with virtualized servers.

 
Five Checkpoints to Implementing IP Telephony
Implementation planning for IP PBX software and IP telephony has become vital as businesses replace discontinued legacy PBX phone systems. This informative whitepaper outlines five "checkpoints" for any implementation plan that will help make IP communications a successful proposition.

 
Hosted Email Security: Staying Ahead of New Threats
In the last two years, email has become a fierce battleground between the nefarious forces of spam and malware, and the heroes of messaging protection. The spam volumes increased alarmingly every month, bringing clever new forms of phishing and virus propagation attacks.

 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 4 hosted by Hostway