
January 18th, 2006, 01:54 AM
|
|
Registered User
|
|
Join Date: Aug 2003
Posts: 7
Time spent in forums: 3 h 13 m 50 sec
Reputation Power: 0
|
|
|
Robot.txt Spider Crawler Question
Hello,
Now that I have my Apache successfully serving more than one host via Virtual Hosting I have been working on the new website. Today I manually submitted to several search engines and decided after reading several posts saying you should have robot.txt, I created it.
While searching through my referer_log and access_log I had a robot.txt request within a few hours. I still don't know how to really identify the bots yet (still researching) but I used my access_log and referer_log in unison matching time stamps. Any tips on how to identify the bot's name?
The weird thing this is that when I do a reverse look up on the IP from the access_log it does not match the domain widexl.com from the referer_log.
widexl is a cgi scripting company and it looks like the root to the IP Address is a mortgage loan company. Now virtual hosting may explain the same IP but why would two businesses be probing for my robot.txt file.
The mortgage company I can see trying to get spam info, which I do not use mailto links on my website and always use a server sided mail form.
Anyone have any ideas on this. Does this look out of context?
Also, how would these companies get my domain information so quickly after submitting to search engines. Now, I did use a couple 3rd party sites that would submit to like 40 at a time. Would one of these be a culprit?
My two logs are posted below.
Thanks
referer_log
[17/Jan/2006:16:50:17 -0700] - -> /robots.txt
[17/Jan/2006:16:50:17 -0700] http://www.widexl.com/cgi-bin/remotely/meta/meta.pl -> /
access_log
69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET /robots.txt HTTP/1.1" 404 297
69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET / HTTP/1.1" 200 6515
reverse dns look up
# host 69.59.172.52
52.172.59.69.in-addr.arpa. domain name pointer california-loans.biz.
|