Dev Shed Lounge
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsOtherDev Shed Lounge

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old June 11th, 2001, 04:17 PM
Karate_Chick Karate_Chick is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2000
Posts: 12 Karate_Chick User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Question Internet Spiders

Does anyone know how an Internet Spider works? I understand the whole concept that the spider goes out and "crawls" the internet sites and looks for meta tag, html code, and things of that nature. I know that it follows links from one page to another, depending on how deep the "spider" software is told to search. My question is how does it know where to search? How does it acutally find the pages?

Reply With Quote
  #2  
Old June 14th, 2001, 10:33 AM
Pressly Pressly is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: May 2001
Posts: 48 Pressly User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 13
Talking My Best Guess

Since URL's are just convenient handles that reslove to numeric IP addresses, I would guess the spider starts with a valid IP number and just incrementally steps through the sequence of possible addresses until it finds a page, then parses the text for more URL's to follow. When it's seen all it can see, it can start back at the next IP number in the sequence.

Reply With Quote
  #3  
Old June 14th, 2001, 10:47 AM
rod k rod k is offline
Apprentice Deity
Dev Shed Loyal (3000 - 3499 posts)
 
Join Date: Jul 1999
Location: Niagara Falls (On the wrong side of the gorge)
Posts: 3,237 rod k User rank is Private First Class (20 - 50 Reputation Level)rod k User rank is Private First Class (20 - 50 Reputation Level) 
Time spent in forums: 4 m 8 sec
Reputation Power: 17
Send a message via AIM to rod k
That sounds efficient on the surface, Pressly, but it's not. What if no machine sits at that IP? What if (like my home computer) I have a firewall that won't respond to an unapproved request? Then your process hangs there waiting for a response until it times out.

Also, many firewall systems will treat that as a hack attempt.

Not to mention that the majority of IPs do NOT have an HTTP server.

All this leads up to your server wasting a lot of resources doing nothing.

KC,

As well as following internal website links, spiders will also harvest external links and follow those as well. Once it gets started a spider could theoretically crawl forever following site links to site links.

Of course, you'll have to give it a list of URLs to start, try to pick link rich sites.
__________________
FSBO (For Sale By Owner) Realty

Reply With Quote
  #4  
Old June 16th, 2001, 10:27 PM
bumperbox bumperbox is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Apr 2001
Location: Tauranga, NZ
Posts: 349 bumperbox User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 13
Try this site, it has some good info on spiders
http://www.robotstxt.org/wc/robots.html

Reply With Quote
  #5  
Old June 18th, 2001, 07:34 AM
Karate_Chick Karate_Chick is offline
Junior Member
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Dec 2000
Posts: 12 Karate_Chick User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: < 1 sec
Reputation Power: 0
Cool

Thanks for the site, bumperbox.

Reply With Quote
Reply

Viewing: Dev Shed ForumsOtherDev Shed Lounge > Internet Spiders

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap