Search Engine Optimization
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsWeb DesignSearch Engine Optimization

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old January 18th, 2006, 12:54 AM
AZAmusements AZAmusements is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 7 AZAmusements User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 3 h 9 m 11 sec
Reputation Power: 0
Robot.txt Spider Crawler Question

Hello,

Now that I have my Apache successfully serving more than one host via Virtual Hosting I have been working on the new website. Today I manually submitted to several search engines and decided after reading several posts saying you should have robot.txt, I created it.

While searching through my referer_log and access_log I had a robot.txt request within a few hours. I still don't know how to really identify the bots yet (still researching) but I used my access_log and referer_log in unison matching time stamps. Any tips on how to identify the bot's name?

The weird thing this is that when I do a reverse look up on the IP from the access_log it does not match the domain widexl.com from the referer_log.

widexl is a cgi scripting company and it looks like the root to the IP Address is a mortgage loan company. Now virtual hosting may explain the same IP but why would two businesses be probing for my robot.txt file.

The mortgage company I can see trying to get spam info, which I do not use mailto links on my website and always use a server sided mail form.

Anyone have any ideas on this. Does this look out of context?
Also, how would these companies get my domain information so quickly after submitting to search engines. Now, I did use a couple 3rd party sites that would submit to like 40 at a time. Would one of these be a culprit?

My two logs are posted below.

Thanks

referer_log
[17/Jan/2006:16:50:17 -0700] - -> /robots.txt
[17/Jan/2006:16:50:17 -0700] http://www.widexl.com/cgi-bin/remotely/meta/meta.pl -> /

access_log
69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET /robots.txt HTTP/1.1" 404 297
69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET / HTTP/1.1" 200 6515

reverse dns look up
# host 69.59.172.52
52.172.59.69.in-addr.arpa. domain name pointer california-loans.biz.

Reply With Quote
  #2  
Old January 18th, 2006, 04:46 PM
rehash rehash is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 164 rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 11 h 25 m 35 sec
Reputation Power: 8
your logs are not complete, they dont show User Agent
check httpd.conf
Comments on this post
stdunbar agrees: Exactly

Reply With Quote
  #3  
Old January 18th, 2006, 04:54 PM
stdunbar stdunbar is offline
Contributing User
Dev Shed Intermediate (1500 - 1999 posts)
 
Join Date: May 2004
Location: Superior, CO, USA
Posts: 1,677 stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level)stdunbar User rank is Captain (20000 - 30000 Reputation Level) 
Time spent in forums: 1 Month 2 Days 3 h 50 m 47 sec
Reputation Power: 301
Send a message via ICQ to stdunbar Send a message via Yahoo to stdunbar
A normal robots.txt GET request will look something like:

Code:
209.191.65.252 - - [18/Jan/2006:12:20:00 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)"

or 

66.249.64.36 - - [17/Jan/2006:22:22:44 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Googlebot/2.1 (+http://www.google.com/bot.html)"


but all on one line. My httpd.conf entry for this VirtualHost is:

apache Code:
Original - apache Code
  1. CustomLog /some/path/to/my/log/file.log combined
__________________
Need Java help? Want to help people who do? Sit down with a cup of Java at the hotjoe forums.

Reply With Quote
  #4  
Old January 18th, 2006, 05:31 PM
AZAmusements AZAmusements is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 7 AZAmusements User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 3 h 9 m 11 sec
Reputation Power: 0
Quote:
Originally Posted by stdunbar
A normal robots.txt GET request will look something like:

Code:
209.191.65.252 - - [18/Jan/2006:12:20:00 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)"

or 

66.249.64.36 - - [17/Jan/2006:22:22:44 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Googlebot/2.1 (+http://www.google.com/bot.html)"


but all on one line. My httpd.conf entry for this VirtualHost is:

apache Code:
Original - apache Code
  1. CustomLog /some/path/to/my/log/file.log combined


I have a separate referer_log and an access_log.
In my httpd.conf it uses common. Would combined give me the user agent variable or do I need to add that too?

The two lines almost look like to separate hits.

Thanks

Reply With Quote
Reply

Viewing: Dev Shed ForumsWeb DesignSearch Engine Optimization > Robot.txt Spider Crawler Question


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump


Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 





© 2003-2008 by Developer Shed. All rights reserved. DS Cluster 5 hosted by Hostway