|
|
|||||||||
|
|||||||||
| |||||||||
|
|
|
| |||||||||
![]() |
|
|
«
Previous Thread
|
Next Thread
»
|
Thread Tools | Search this Thread | Rate Thread | Display Modes |
|
#1
|
|||
|
|||
|
Robot.txt Spider Crawler Question
Hello,
Now that I have my Apache successfully serving more than one host via Virtual Hosting I have been working on the new website. Today I manually submitted to several search engines and decided after reading several posts saying you should have robot.txt, I created it. While searching through my referer_log and access_log I had a robot.txt request within a few hours. I still don't know how to really identify the bots yet (still researching) but I used my access_log and referer_log in unison matching time stamps. Any tips on how to identify the bot's name? The weird thing this is that when I do a reverse look up on the IP from the access_log it does not match the domain widexl.com from the referer_log. widexl is a cgi scripting company and it looks like the root to the IP Address is a mortgage loan company. Now virtual hosting may explain the same IP but why would two businesses be probing for my robot.txt file. The mortgage company I can see trying to get spam info, which I do not use mailto links on my website and always use a server sided mail form. Anyone have any ideas on this. Does this look out of context? Also, how would these companies get my domain information so quickly after submitting to search engines. Now, I did use a couple 3rd party sites that would submit to like 40 at a time. Would one of these be a culprit? My two logs are posted below. Thanks referer_log [17/Jan/2006:16:50:17 -0700] - -> /robots.txt [17/Jan/2006:16:50:17 -0700] http://www.widexl.com/cgi-bin/remotely/meta/meta.pl -> / access_log 69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET /robots.txt HTTP/1.1" 404 297 69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET / HTTP/1.1" 200 6515 reverse dns look up # host 69.59.172.52 52.172.59.69.in-addr.arpa. domain name pointer california-loans.biz. |
|
#2
|
|||
|
|||
|
your logs are not complete, they dont show User Agent
check httpd.conf |
|
#3
|
|||
|
|||
|
A normal robots.txt GET request will look something like:
Code:
209.191.65.252 - - [18/Jan/2006:12:20:00 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-" "Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)" or 66.249.64.36 - - [17/Jan/2006:22:22:44 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" but all on one line. My httpd.conf entry for this VirtualHost is:
__________________
Need Java help? Want to help people who do? Sit down with a cup of Java at the hotjoe forums. |
|
#4
|
|||
|
|||
|
Quote:
I have a separate referer_log and an access_log. In my httpd.conf it uses common. Would combined give me the user agent variable or do I need to add that too? The two lines almost look like to separate hits. Thanks |
![]() |
| Viewing: Dev Shed Forums > Web Design > Search Engine Optimization > Robot.txt Spider Crawler Question |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
|
|