Canonical Tag


An important prerequisite for website content to be indexed is that it can only be accessible under a single URL. If the content is available on other pages as well, the second URL must refer to the original resource. Otherwise, this leads to duplicate content. If two URLs contain the same contain content, Google will not know which one to index and may index the least important page. To avoid this, so-called canonical URLs are used. These are obtained by adding a canonical tag to the header are of the HTML code.

Canonical tag definition[edit]

The Canonical tag is a specification in the source code of a website. It refers to a standard resource - the canonical URL - for websites with the same or almost identical content. If a canonical URL is correctly marked, search engines will index this source only, meaning that duplicate content can be avoided. Search engines rate duplicate content negatively because there is no added value for the Internet user. A duplicate content checker can be used to detect duplicate content.

The canonical tag should be used when content exists on more than URL - sometimes this cannot be avoided for the following reasons:

  • The homepage can be reached from different URLs (for example www.domain.com, domain.com, www.domain.com/index.html and so on).
  • Pages can be reached with and without Trailing Slashes (“/”) and with case sensitivity
  • Because of URL Rewriting, the server only pays attention to one ID and admits variations of the address
  • IDs (as Session-IDs or product filters) are used that don’t change the content
  • Content is presented in different versions (e.g. print version, PDF etc.)
  • There are HTTPS variants of the site
  • The URL is still available under a HTTP version without SSL encryption
  • Additional content is being published on other, external websites

A self referencing canonical should be included on every subpage of a website, so that unexpected errors can be prevented.

CanonicalTag.png

Examples of canonical URLs[edit]

In general, there are two ways of indicating a canonical URL. In both cases, Google recommends absolute URLs – meaning the entire web address.

  • The syntax of the first case looks like this:
https://www.example.com/example.htm

The element containing the canonical attribute is placed in the element of the source code and complements the document’s metadata. It refers to the standard page, but is only used where sites that are not being treated as original resource exhibit identical content.

Let’s assume there are the following two websites:

http://www.example.com/examplepage.htm
http://www.example.com/examplepage/?session_id=xyz.htm

The first one is now our standard resource. The second one is a session as commonly used by online shops in order to be able to store user related data as e.g. items in the shopping cart. The canonical tag should be integrated into the head element of the second page. It contains a reference to the standard resource which is the first page. This indicates to Google and other search engines which page is the more important, and this page should be included in the index.

  • If the standard resource is a PDF document or another file type supported by Google, the Canonical Tag needs to be included into the site’s header. The syntax is different and the incorporation requires knowledge of the Hypertext Transfer Protocol (HTTP):
 Link: <http://www.example.com/examplepage.pdf>; rel=”canonical” 

This is not only an indication in the document, but rather an instruction for the answer of the HTTP protocol: If the client (e.g. browser or search engine) sends a request, the server replies that this site is the canonical URL. Sometimes the server needs to be reconfigured.

Let’s now assume there are these two websites:

http://www.example.com/examplepage.htm
http://www.example.com/examplepage.pdf

The second site should be the standard resource. As it is a PDF file, the canonical tag needs to be integrated into the site’s header. It refers to itself and tells Google, that the PDF document serves as standard for the indexation.

Background[edit]

With the canonical tag, website operators can tell search engines which of the pages with identical content should be handled as standard resource. Using a canonical tag is the best and easiest way to solve duplicate content. As a consequence, webmasters influence the link popularity of sites with identical content, as the domain authority can be focussed on a single URL. 

Use cases[edit]

  • Canonical tags and pagination: when paginating websites with rel= "next" and rel= "prev", each page should refer to itself via canonical, or there should be a "view-all" page, where all products can be visible in one overview. When using rel="next" and rel="prev", the best case would be to not use canonical tags. Instead, add a robot tag to the meta element of the paginated page (from the second page) and exclude the subpages from indexing.
  • Canonical Tags and hreflang: If a website uses hreflang, the URLs should either refer to themselves with a canonical tag, or should not use a canonical tag at all. If both hreflang and canonical tags are used, Google receives conflicting signals. While the hreflang tag shows that there is another language version available, the canonical tag would make this version the original URL.
  • Canonical Tags and Noindex: With the noindex tag, webmasters can convey to Google that a URL should not be indexed. If a canonical tag refers to this page, Google receives unclear signals, as a canonical URL is the relevant page a webmaster wants to be indexed. Webmasters should therefore decide between a canonical and noindex tag.

Frequent errors[edit]

Canonical tags are powerful. if applied incorrectly, websites may be completely ignored by Google, which could be a disaster for traffic and sales. A webmaster should firstly decide whether the content is in fact identical or almost identical - canonical tags only make sense in this case. 

Frequent errors are:

  • Pages paginated / numbered with rel="next" and rel="prev" - canonical tags don’t make sense as technically speaking, this is not about identical content.
  • When a webpage is referred to with a canonical tag, it must be available, otherwise a 404 error apge will be given. 
  • Combining “noindex”, “disallow” or “nofollow” tags and canonical URLs is explicitly unwelcome.[1]
  • The canonical tag is not to be found in a document’s body and may not be used repeatedly in the meta data.
  • A relative Path is specified as a canonical link target. This may cause the Googlebot to misinterpret the tag and it therefore loses its effect. For this reason, the link should always be specified as a complete URL in the canonical tag.
  • The syntax is ignored. It makes a difference if the canonical tag refers to https://page.com/ or https://page.com. Therefore, all characters should always be taken into account when specifying the URL. The same applies to the protocol. For example, the canonical tag should not refer from https to the http protocol. In January 2017, Google stated that the use of a secure HTTPS connection would become an important ranking factor for websites. Since then, Google has preferred HTTPS pages to canonical URLs.[2]. The Canonical tag should therefore point from HTTP protocol to the HTTPS page, not vice versa.
  • The canonical tag refers to the home page of the domain. In this case, only the start page will be interpreted as a canonical URL. As a result, Google may only index them in the medium term.
  • The Canonical Tag refers to the homepage of a website. The tag would be set incorrectly because it indicates that there are duplicates of a page. With pagination, the contents of the pages and the URLs are not the same. Google is merely informed that the relevant paginated page is part of a series of pages in the same category.
  • Incorrect use of the canonical tag results in canonical chains or cross-references. Target pages of a canonical link should not refer to other canonicals.

Alternatives[edit]

With the Google Search Console, webmasters can specify how Google should handle parameters of a website. This can cause the Googlebot to ignore certain URLs of a page.

References[edit]

  1. Mueller (Google) regarding the combination of noindex and canonical reddit.com. Accessed on November 28, 2018
  2. General Guidelines for All Canonicalization Methods support.google.com. Accessed on November 28, 2018.

Web Links[edit]