Pages: [1]
  Print  
Author Topic: Site indexing part 1: Robots.txt  (Read 1428 times)
rork
Guest
« on: September 21, 2010, 11:04:09 am »

'When a search engine wants to index your website it uses a robot called a Web spider. These web spiders start at your main page and follows the links it finds to crawl trough your website and index all your pages. This process will be repeated every once in a while to keep the search engine's database up to date.

But what if you want only some parts of your website to be indexed? Currently there are two major techniques for controlling wich pages a web spider indexes: a robots.txt file and metatags. In this article I will explain how to setup a simple robots.txt using my own websites as an example. I will deal with meta tags and my views on which pages to index in part 2 and 3 of this series.

When should you use a robots.txt?
The only thing you need is access to the root directory of your websites domain. If you use a free hosting provider this might not be the case and web spiders won't find your robots.txt.

Because the robots.txt blocks access completely it can be used to prevent web spiders from downloading parts of your site with many pages like documentation or statistics. A drawback of this is that it doesn't use the page as a hub to find other pages, therefore indexpages shouldn't be blocked.

Syntax
The robots.txt is a plain text file that is stored in your domain's root directory, e.g. http://www.your-domain.com/robots.txt. The syntax of a robots.txt is fairly simple. It contains of a header followed by a number of rules.

The header User-agent:  * tells to which web spider the following rules apply, you can use the name of the bot or * to make a set of general rules that apply for all other web spiders. There are no conventions about the order in which the user agents should be listed, some web spiders may use the first set that applies to them, others may look trough the whole robots.txt for their own specific rules and if they can't find them use the general rules. To be save you could place the general rules at the end of the file.

The rules tells the web spider what it can, or rather can't do. There is one standard rule: Disallow: /, this tells the web spider not to visit url's that start with the following phrase. For example Disallow: / will tell the web spider that it can't visit any url on your website and Disallow: /stats tells it it can't visit pages like /stats.php, /stats-extra.php, /stats/index.php etc. Some web spiders may support an additional set of rules.

If we look at my robots.txt you'll see two headers: one for the yandex web spider blocking all access to my website and one set of general use only blocking specific pages which I don't think need indexing.

Retrieving web spiders names
In order to setup web spider specific rules you have to know the name of the web spider visiting your sites. Modern counters will show you a list of web spiders indexing your site but sometimes you'll have to do with a list of ip-adresses. To check if an ip-address belongs to a bot you can simply look up the ip address with your favourite search engine or use an ip to hostname converter like Arul's converter. For example if your website is often visited by IP 66.249.66.1 and you convert it to a hostname the result is crawl-66-249-66-1.googlebot.com. This obviously is the googlebot. Keywords like bot, crawl, spider or a search engine's name generally indicate it's a web spider.

Specifying url's
The next step would be to set a series of rules but for this you'll need to find the url's of the pages you want to block. For single files or directories this would be simple, just add them to the list. However many sites use a CMS in which every page is loaded from one file. To navigate through the pages arguments are provided to this. This site uses a number of pages: index.php?catid=1, index.php?itemid=3, index.php?archivelist=1. These navigate trough a list of articles for a category, a blogpost and the archive. If I want to block the categories I can add /index.php?catid= to robots.txt and it won't be indexed.

Most cms's use a standard order of arguments starting with an action and then providing the id of the page to show. If you add the url including the action but without the page id you'll block all these pages. The only thing you have to do is look around your website, click the links and spot paterns in the urls.

Be aware however that the web spider looks for a match in the url and you don't block pages further in the websites tree which you want to be indexed.'
Logged
Pages: [1]
  Print  
 
Jump to: