Blocked Content

Blocked content are pages that are blocked from search engines for various reasons. These could be pages that may not be indexed by search engines such as pages in beta status or pages with duplicate content. There are various methods such pages can be blocked from search engines.

The methods are

Robots.txt,
IP blocking,
meta robots.

Robots.txt

Robots.txt (also: robots exclusion protocol) is a text file for robots, which is stored in the root directory. When indexing a page, the robot checks if a robots.txt file exists and what instructions it contains. Specific pages or entire directories can be excluded with the robots.txt file. They will be ignored by the search engine bots and not crawled or indexed. There are times, however, when pages get included in the index despite other instructions in the robots.txt file. This happens especially when pages are accessible from other pages, in other words linked to by other pages.

IP blocking

IP blocking can also prevent pages from being included in the search engine index. Certain user-agents (for example, search engine robots, spam bots) are excluded through a .htaccess file. But this method is only useful if the name of the bot trying to access and its IP is known. Since search engine bots disguise themselves temporarily as other bots, the exclusion from the index is not necessarily guaranteed.

Google Analytics can be anonymized in order for Google Analytics not to be able to store the IP address.

Meta robots

The third and probably the most effective method to exclude website content from being indexed by search engines is the use of meta-robots. Meta robots is an HMTL meta tag that gives search engine bots specific directions whether the site should be included in the search engine index or the links on the page should be followed. This meta tag is declared in the header of a page. If you want to exclude content of the page, the instructions in robots-tag would be:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

Recommendation

When blocking pages, it is particularly important to exclude the correct content. You have to ensure that important pages are well linked internally and are not accidentally blocked. If valuable pages are blocked, they cannot be indexed and pass on any valuable link juice.

Web Links

About the Robots <META> tag