Search Engine Optimization
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsWeb DesignSearch Engine Optimization
The ASP Free website provides in-depth information on the latest developer tools available from Microsoft. Our cadre of writers, highly experienced industry experts, reveals the best ways to use established technologies as well as new and emerging technologies. Our coverage of Microsoft's development and administration technologies is among the most respected in the IT industry today.

ASP Free and Iron Speed Designer are giving away $5,500+ in FREE licenses. Iron Speed's RAD CASE toolset can save up to 80% of your coding time. One free license per week, one perpetual license per month!
Download and Activate to enter!

Intel® Graphics Performance Analyzers is a powerful tool suite for analyzing and optimizing your games, media, and graphics-intensive applications. Used by some of the best developers on the planet, Intel GPA lets you maximize your app’s performance.


Tutorials
| Forums

Download to Enter
| Contest Rules

DOWNLOAD INTEL® GPA FOR FREE

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old January 18th, 2006, 01:54 AM
AZAmusements AZAmusements is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 7 AZAmusements User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 3 h 13 m 50 sec
Reputation Power: 0
Robot.txt Spider Crawler Question

Hello,

Now that I have my Apache successfully serving more than one host via Virtual Hosting I have been working on the new website. Today I manually submitted to several search engines and decided after reading several posts saying you should have robot.txt, I created it.

While searching through my referer_log and access_log I had a robot.txt request within a few hours. I still don't know how to really identify the bots yet (still researching) but I used my access_log and referer_log in unison matching time stamps. Any tips on how to identify the bot's name?

The weird thing this is that when I do a reverse look up on the IP from the access_log it does not match the domain widexl.com from the referer_log.

widexl is a cgi scripting company and it looks like the root to the IP Address is a mortgage loan company. Now virtual hosting may explain the same IP but why would two businesses be probing for my robot.txt file.

The mortgage company I can see trying to get spam info, which I do not use mailto links on my website and always use a server sided mail form.

Anyone have any ideas on this. Does this look out of context?
Also, how would these companies get my domain information so quickly after submitting to search engines. Now, I did use a couple 3rd party sites that would submit to like 40 at a time. Would one of these be a culprit?

My two logs are posted below.

Thanks

referer_log
[17/Jan/2006:16:50:17 -0700] - -> /robots.txt
[17/Jan/2006:16:50:17 -0700] http://www.widexl.com/cgi-bin/remotely/meta/meta.pl -> /

access_log
69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET /robots.txt HTTP/1.1" 404 297
69.59.172.52 - - [17/Jan/2006:16:50:17 -0700] "GET / HTTP/1.1" 200 6515

reverse dns look up
# host 69.59.172.52
52.172.59.69.in-addr.arpa. domain name pointer california-loans.biz.

Reply With Quote
  #2  
Old January 18th, 2006, 05:46 PM
rehash rehash is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2005
Posts: 164 rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level)rehash User rank is Sergeant (500 - 2000 Reputation Level) 
Time spent in forums: 11 h 27 m 41 sec
Reputation Power: 12
your logs are not complete, they dont show User Agent
check httpd.conf
Comments on this post
stdunbar agrees: Exactly

Reply With Quote
  #3  
Old January 18th, 2006, 05:54 PM
stdunbar's Avatar
stdunbar stdunbar is offline
Contributing User
Dev Shed Regular (2000 - 2499 posts)
 
Join Date: May 2004
Location: Superior, CO, USA
Posts: 2,331 stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level)stdunbar User rank is General 10th Grade (Above 100000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 1 h 21 m 52 sec
Reputation Power: 1653
Send a message via Yahoo to stdunbar Send a message via Google Talk to stdunbar
A normal robots.txt GET request will look something like:

Code:
209.191.65.252 - - [18/Jan/2006:12:20:00 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)"

or 

66.249.64.36 - - [17/Jan/2006:22:22:44 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Googlebot/2.1 (+http://www.google.com/bot.html)"


but all on one line. My httpd.conf entry for this VirtualHost is:

apache Code:
Original - apache Code
  1. CustomLog /some/path/to/my/log/file.log combined
__________________
Need Java help? Want to help people who do? Sit down with a cup of Java at the hotjoe forums.

Reply With Quote
  #4  
Old January 18th, 2006, 06:31 PM
AZAmusements AZAmusements is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Aug 2003
Posts: 7 AZAmusements User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 3 h 13 m 50 sec
Reputation Power: 0
Quote:
Originally Posted by stdunbar
A normal robots.txt GET request will look something like:

Code:
209.191.65.252 - - [18/Jan/2006:12:20:00 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Yahoo-MMCrawler/3.x (mms dash mmcrawler dash support at yahoo dash inc dot com)"

or 

66.249.64.36 - - [17/Jan/2006:22:22:44 -0700] "GET /robots.txt HTTP/1.0" 200 113 "-"
"Googlebot/2.1 (+http://www.google.com/bot.html)"


but all on one line. My httpd.conf entry for this VirtualHost is:

apache Code:
Original - apache Code
  1. CustomLog /some/path/to/my/log/file.log combined


I have a separate referer_log and an access_log.
In my httpd.conf it uses common. Would combined give me the user agent variable or do I need to add that too?

The two lines almost look like to separate hits.

Thanks

Reply With Quote
Reply

Viewing: Dev Shed ForumsWeb DesignSearch Engine Optimization > Robot.txt Spider Crawler Question


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.

© 2003-2012 by Developer Shed. All rights reserved. DS Cluster 7 - Follow our Sitemap