#1
  1. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2000
    Posts
    12
    Rep Power
    0

    Question Internet Spiders


    Does anyone know how an Internet Spider works? I understand the whole concept that the spider goes out and "crawls" the internet sites and looks for meta tag, html code, and things of that nature. I know that it follows links from one page to another, depending on how deep the "spider" software is told to search. My question is how does it know where to search? How does it acutally find the pages?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    May 2001
    Posts
    48
    Rep Power
    14

    Talking My Best Guess


    Since URL's are just convenient handles that reslove to numeric IP addresses, I would guess the spider starts with a valid IP number and just incrementally steps through the sequence of possible addresses until it finds a page, then parses the text for more URL's to follow. When it's seen all it can see, it can start back at the next IP number in the sequence.
  4. #3
  5. No Profile Picture
    Apprentice Deity
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Jul 1999
    Location
    Niagara Falls (On the wrong side of the gorge)
    Posts
    3,237
    Rep Power
    19
    That sounds efficient on the surface, Pressly, but it's not. What if no machine sits at that IP? What if (like my home computer) I have a firewall that won't respond to an unapproved request? Then your process hangs there waiting for a response until it times out.

    Also, many firewall systems will treat that as a hack attempt.

    Not to mention that the majority of IPs do NOT have an HTTP server.

    All this leads up to your server wasting a lot of resources doing nothing.

    KC,

    As well as following internal website links, spiders will also harvest external links and follow those as well. Once it gets started a spider could theoretically crawl forever following site links to site links.

    Of course, you'll have to give it a list of URLs to start, try to pick link rich sites.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Apr 2001
    Location
    Tauranga, NZ
    Posts
    349
    Rep Power
    14
    Try this site, it has some good info on spiders
    http://www.robotstxt.org/wc/robots.html
  8. #5
  9. No Profile Picture
    Junior Member
    Devshed Newbie (0 - 499 posts)

    Join Date
    Dec 2000
    Posts
    12
    Rep Power
    0

    Cool


    Thanks for the site, bumperbox.

IMN logo majestic logo threadwatch logo seochat tools logo