Pages: [1]
Author Topic: Site indexing part 3: To index or not to index?  (Read 1209 times)
« on: October 06, 2010, 09:31:00 am »

'This is the third and last part of my series about site indexing by web spiders. In contrast to the first two parts which contained the technical implementations of the robots.txt and meta tags this is my personal view about which pages require indexing, which only following and which should not be visited at all.

I will discuss a couple of software packages and how web spiders should handle them. The first would be a standard weblog like this one, the second a forums system and the third and last is documentation or help files.

General thoughts
There are two reasons to block pages: for your own good and for the search engine/users good. Blocking pages will save you traffic which might save you some money and serverload.

For search engines (and the people who use them) it's important to find the information they are looking for. Some pages are likely to generate cross article search results. When someone searches for two terms, e.g. linux and nvidia, it's not unlikely he will find an index page with links to two articles: one covering linux and another one covering nvidia.

Blocking certain pages can also work both ways. There are many pages which have a "print this" button, if the article is indexed the "print this" link is followed and indexed as well resulting of a duplicate of the article being stored in the server database. Blocking the "print page" will save you some traffic and your site will show up better in the search engines.

Of course the web spider could also be programmed to skip index files and "print pages", however with all the different and custom CMS's around it could be hard to decide what to index and what not. It should be easier for the webdevelopers to implement a set of simple rules.

From my point the most important part of a weblog are the articles published with their comments. These are the only pages that have to be indexed. Other pages that can be found on a standard weblog like the index page, categories and an archive are prone to provide cross article search results. Besides that these other pages will contain (partial) duplicates of the articles which only fill up search results without contributing anything new compared to the article.

Ofcourse the articles have to be found, therefore the index page should be marked to be followed but doesn't have to be indexed.

For forums about the same applies as for weblogs, topics are the single most important parts of information and therefore should be indexed. The index tree to the separate forums is required to lead the web spider to the topic and should be followed. Indexing the index tree is prone to cross topic polution of search results and therefore shouldn't be allowed.

If pages like "print page" or reply are indexed they'll only enter duplicates in the search engines database, they won't lead to the topics or new interesting information and therefore can be blocked alltogether.

Most forum systems provide a list of members, some statistics and some help pages about how the forum should be used. From my opinion all of these are interesting to the current forum members only and it's unlikely that they are major attractants of new users. Therefore these can be blocked also.

Documentation can be anything between a couple of help pages and the complete perl documentation consisting of thousends of pages. If it's your own or a mirror that's frequently updated: index it, if it isn't block it. Documentation can probably draw a lot of visitors but it would generally be the documentation they're after, not your website. Besides that if you don't keep it up to date you may give them out-dated documentation.

In these three articles I've written about techniques to control web spiders and my views on which pages to index. There are more techniques to control web spiders and more advanced techniques will be developed giving you even more control over them. With this last article I hope I made clear why to control the web spiders and which pages to block. Good luck implementing this in your website.'
Pages: [1]
Jump to: