| Originally Posted by Donna from pixel2life.com
Why do I need this robots.txt file anyway?
A great reason to use a robots.txt file is actually the fact that many search engines, including Google, post suggestions for the public to make use of this tool. Why is it such a big deal that Google teaches people about the robots.txt? Well, because nowadays, search engines are not a playground for scientists and geeks anymore, but large corporate enterprises. Google is one of the most secretive search engines out there. Very little is known to the public about how it operates, how it indexes, how it searches, how it creates its rankings, etc. In fact, if you do a careful search in specialized forums, or wherever else these issues are discussed, nobody really agrees on whether Google puts more emphasis on this or that element to create its rankings. And when people don't agree on things as precise as a ranking algorithm, it means two things: that Google constantly changes its methods, and that it does not make it very clear or very public. There's only one thing that I believe to be crystal clear. If they recommend that you use a robots.txt ("Make use of the robots.txt file on your web server" - Google Technical Guidelines), then do it. It might not help your ranking, but it will definitely not hurt you.
There are other reasons to use the robots.txt file. If you use your error logs to tweak and keep your site free of errors, you will notice that most errors refer to someone or something not finding the robots.txt file. All you have to do is create a basic blank page (use Notepad in Windows, or the most simple text editor in Linux or on a Mac), name it robots.txt and upload it to the root of your server (that's where your home page is).
On a different note, nowadays, all search engines look for the robots.txt file as soon as their robots arrive on your site. There are unconfirmed rumors that some robots might even 'get annoyed' and leave, if they don't find it. Not sure how true that is, but hey, why not be on the safe side?
Again, even if you don't intend to block anything or just don't want to bother with this stuff at all, having a blank robots.txt is still a good idea, as it can actually act as an invitation into your site.
Don't I want my site indexed? Why stop robots?
Some robots are well designed, professionally operated, cause no harm and provide valuable service to mankind (don't we all like to "google"). Some robots are written by amateurs (remember, a robot is just a program). Poorly written robots can cause network overload, security problems, etc. The bottom line here is that robots are devised and operated by humans and are prone to the human error factor. Consequently, robots are not inherently bad, nor inherently brilliant, and need careful attention. This is another case where the robots.txt file comes in handy - robot control.
Now, I'm sure your main goal in life, as a webmaster or site owner is to get on the first page of Google. Then, why in the world would you want to block robots?
Here are some scenarios:
1. Unfinished site
You are still building your site, or portions of it, and don't want unfinished pages to appear in search engines. It is said that some search engines even penalize sites with pages that have been "under construction" for a long time.
2. Security
Always block your cgi-bin directory from robots. In most cases, cgi-bin contains applications, configuration files for those application (that might actually have sensitive information), etc. Even if you don't currently use any CGI scripts or programs, block it anyway, better safe than sorry.
3. Privacy
You might have some directories on your website where you keep stuff that you don't want the entire Galaxy to see, such as pictures of a friend who forgot to put clothes on, etc.
4. Doorway pages
Besides illicit attempts to increase rankings by blasting doorways all over the internet, doorway pages actually do have a very morally sound usage. They are similar pages, but each one is optimized for a specific search engine. In this case, you must make sure that individual robots do not have access to all of them. This is extremely important, in order to avoid being penalized for spamming a search engine with a series of extremely similar pages.
5. Bad bot, bad bot, what'cha gonna do...
You might want to exclude robots whose known purpose is to collect email addresses, or other robots whose activity does not agree with your beliefs on the world.
6. Your site gets overwhelmed
In rare situations, a robot goes through your site too fast, eating your bandwidth or slowing down your server. This is called "rapid-fire" and you'll notice it if you are reading your access log file. A medium performance server should not slow down. You may however have problems if you have a low performance site, such as one running of your personal PC or Mac, if you run poor server software, or if you have heavy scripts or huge documents. Is these cases, you'll see dropped connections, heavy slowdowns, in extremes, even a complete system crash. If this ever happens to you, read your logs, try to get the robot's IP or name, read the list of active robots and try to identify and block it. |