Rork.nl - Articles - Configuring a robots.txt

Rork.nl - Articles	\| Home \| Articles \| Downloads \| Contact \|
Configuring a robots.txt These days all big searchengines use robots called webcrawlers or spiders to grab the content of websites and store it in the search engine's database. These spiders are designed to look for links (internal and external) in a page and follow them to find new page. To keep the content up to date the spider will revisit the page after sometime. It's easy to see how the robot got his name: it crawls trough a web of links. When you just let a robot set loose on the web it can consume quite a lot of traffic in a short timespan which may overload a network or server. Therefore most commercial and open source crawlers hold on to a couple of conventions, e.g. not to visit a domain more then once in a short timespan. This however doesn't prevent a spider from downloading your complete website and thereby consuming a lot of trafic, therefore the Robot Exclusion Standard has been set up. This allows you to disallow robots that use this standard¹ access to parts of your site that you don't want to have indexed. In this article I will show some basics of setting up a custom robots.txt for your website, I will begin with explaining the syntax and then how to apply this to a basic websites like my domain and a forum systems like I use for my site for the XOL DOG UT Servers. Robots.txt syntax The robots.txt is a plain text file that is put in your websites root folder, e.g. http://www.my-domain.com/robots.txt. The syntax is fairly simple: it contains of a header and a set of disallow rules. The header starts with "User-agent:" followed by the name of a user-agent. A disallow rule starts with "Disallow: " followed by the beginning of the url without the domain name. One user agent can have multiple rules, and there is no limit to the number of user-agents in the robots.txt. These are the standards for a robots.txt however some spiders like Googlebot may allow additional rules. The user-agent is the name of a bot, a typical way to find this is to look up the ip-address's hostname on a site like: ip2country.org, for example 66.249.66.1 will return the hostname "crawl-66-249-66-1.googlebot.com" which will tell you the bot is from google. You can now find the user-agent of the bot on googles website. However I this strategy is only usefull if your sites statistics say that one ip uses a lot of bandwidth, then you can track the IP like abover or just search the web. There's a big chance more people will have problems with it and you're able to track down the user-agent that way. For other, normal, spiders just use the general list with the user-agent , these rules will be used for all spiders not mentioned in the robots.txt. Some spiders will read the robots.txt top down and use the first match for their user-agent they find, therefore it's best to put the last in the file. As stated above the disallow rule only needs the first part of the an address, for example "Disallow: /downloads" will block any address that starts with http://www.my-domain.com/downloads including: http://www.my-domain.com/downloads.html http://www.my-domain,com/downloads_of_games. If you only want to block the downloads directory simply end this rule with a forward slash: "http://www.my-domain.com/downloads/" now downloads.html and downloads_of_games will be normally indexed. Depending on what you want to block you can use partial or complete addresses. If you look at my robots.txt you will see two headers, one for yandex and one for all other spiders. Below yandex you see "Disallow: /" which means I told the bot to ignore my complete website. Note that this is not what you'd normally want however this bot was indexing my entire site every day. Below the other user agents is a longer list with parts of this site that are blocked, I will discuss why and how I blocked them below. General sites Most sites come with parts that may be interesteresting for the webmaster or users but are not vital to attract new vistors or may update faster then the search engine. For my site these parts are mainly the statistics of my website and the statistics of the UT server I maintain, especially the utstats update daily and a big part is removed monthly so allowing them to be indexed would probably only consume a lot of data trafic. Another typical example is a directory you use to send other people files which will only be hosted for this purpose and will be removed after, although these files typically aren't linked too it doesn't harm to deny the spider access. If you use a content management system (CMS) it may come with help pages, a page that tells you more about the system, maybe a credits page. This is interesting for people who want to use the same system as you do but people who search for the CMS won't be interested in your page but more so in the CMS' main website. Therefore feel free to block these parts too. Other parts of your website that you may want to disallow access too are parts that you are going to remove, it's a nice way of telling the search engine to stop showing links to that part of the site. On the other hand you might just remove the part and show the spider that the page doesn't exist anymore so it will remove this it from it's database. You may think now that this would be the way to keep a "secret part" of your website secret, however it should be obvious that the robots.txt is very easy to find and therefore your secret part too. The only way to keep it secret is to make absolutely no link to it including referers. A note on forum systems² Did you ever found a forum post on Google and clicked "More results from this site"? There's quite a chance that the same forum post shows up more then once. This happens because a forum has more then one URL for the same topic: one to just read it, one to add a comment, one for a userfriendly version, maybe an archive and all are listed in the search engine. Preventing a spider to index these pages is fairly simple, look at your page and point your mouse over links that you don't want to have indexed so they show up in the status bar. They usually contain of a fixed part and a part that refers to the thread tipically with a number. The trick is to enter the common part of the address to your robots.txt. For example if I hover above the print link on my forums I see the address: http://www.rork.nl/ut/index.php?action=printpage;topic=39.0 If I compare this with the same link on other pages you will only see the 39.0 change, the other part stays the same and therefore is the base you want to block. However in some cases this part might be the same as for the main topic page so beware that you won't block it in that case. It might be worth it to make an overview of all links on the page, sort them alphabetically and see what you can block. Keep in mind though that a spider will visit your forum as a vistor, therefore some links that may show up if you are a user don't show to them or might be a little different. The most important part of a forum are the topics, this most likely contains the information someone is looking for, therefore in my robots.txt most other parts of the forum is blocked: members, help texts, statistics etc. And to prevent topics from showing up triple I also blocked pages where you can reply to the topic or where you can see a printer friendly version. Conclusion As you read configuring your robots.txt may make spiders more userfriendly for your website because you can limit the number of pages it has to download and therefore saving serverload and traffic. At the same time you make your website more userfriendly to the searchengine and your users because you are able to block uninteresting or double pages so more valid results are returned by the search engine. I hope this article increases your understanding of how and why to use a robots.txt. And that it aids you in deciding which parts of your website to block. When this is done proper you may save yourself quite some bandwidth and your site may look better on search engine result pages. ¹ Most commercial and open source spiders will use this standard, however amateur or spiders that search for emailadresses might not. ² In this article I use a forum for that is what I have experience with, however this would equaly apply to all website systems that use comments like weblogs or photogalleries.