Duplicate Content


The term duplicate content comes from Search Engine Optimization. Duplicate content is created when the same content can be accessed with different URLs and indexed with different URLs. Indexing websites with duplicate content can have a negative effect on the Ranking in the SERPs.

Types of duplicate content

Duplicate content can arise if:

  • Content is syndicated, sold or illegally copied, then different websites use the same content. In this case, duplicate content can harm the creator.
  • The content of a website is accidentally displayed on different domains or subdomains (e.g., without “www”).
  • Content is used twice in different categories. This can happen if contents of a URL are published in a news area.
  • The content management system cannot assign unique URLs to the content.
  • Different attribute filters in online shops give same product lists.

“Near duplicate content” is very similar content that could also lead to problems. Text blocks that are copied often (such as teasers or recurring texts on every page) could be depicted as duplicate content by search engines.

Background

Google has made various adjustments to its algorithms to ensure that the search engine can filter out duplicate content very well. Both the Brandy Update in 2004 and the Bourbon Update in 2005 improved Google’s ability to detect duplicate content.

Consequences of duplicate content

Duplicate content presents search engines with a problem. They have to decide which of the duplicate pages is most relevant to a search query. Google emphasizes that "duplicated content on a website[...] is not a reason for taking action against this website". However, the search engine provider reserves the right to impose penalties for manipulative intentions: "In the rare cases where we have to assume that duplicate content is displayed with the intention of manipulating the ranking or deceiving our users, we make the appropriate corrections to the index and ranking of the websites concerned." Webmasters should not leave it to Google to decide whether duplicate content is unintentional or deliberately created - they should simply avoid duplicate content.

DuplicateContent.png

Technical causes of duplicate content

Duplicate content can have different causes, which are often based on the incorrect configuration of servers.

Duplicate content due to poor server configuration

The basics of preventing duplicate content within one’s own website are in the server configuration. The following problems can be solved quite easily:

Duplicate content due to a Catch-All / Wildcards subdomain

One of the most basic on-page SEO errors arises when a domain simultaneously responds to all subdomains. This can easily be tested by simply visiting

"http://www.DOMAIN.com” followed by “http://DOMAIN.com” (i.e., without “www”)

If the same content is shown in both cases (and the address bar still shows the entered domain), one should act quickly. In the worst case, the server responds to all subdomains – including a subdomain such as

“http://potatoe.DOMAIN.com”

These other pages with the same content are referred to as doublets. To make it easier for the search engines to decide which URL is relevant, one should configure the server correctly. This can, for instance, be done using the mod-rewrite module for the commonly used Apache server. With an .htaccess file in the root directory of the website, one can teach the following code to the server via a 301-redirect to ensure the website only responds to the correct domain – and automatically redirects the usual subdomains to the correct domain:


RewriteEngine On
# ! Please remember to replace “DOMAIN2 with the respective domain of your project !
RewriteCond %{HTTP_HOST} !^www.DOMAIN.com$ [NC]
RewriteRule (.*) http://www.DOMAIN.com/$1 [R=301,L]

As a preliminary consideration, one should first decide what the main domain should be – i.e., with or without “www”? For international websites, country identification should also be considered as a subdomain

http://en.DOMAIN.com/

Duplicate content due to missing trailing slashes

Another widespread form of duplicate content arises from the use of trailing slashes. These are URLs that have do not contain file names but rather point to directories. For example:

http://www.DOMAIN.com/register_a/register_b/

This (usually) opens the index file of the “register_b” subfolder. Depending on the configuration, the following URL also responds in a similar manner:

http://www.DOMAIN.com/register_a/register_b 

In the example above, the last slash is missing. The server first tries to find the file “register_b”, which does not exist, but then realizes that such a folder exists. Since the server does not want to return an unnecessary error message (“file does not exist”), the index file of this folder is displayed instead. In principal, this is a good thing but unfortunately results in duplicate content (as soon as a link points to a “false” URL). This problem can be handled in different ways:

The best way to go about it is using a 301 redirect via .htaccess as well as by rectifying faulty links. This saves Google the unnecessary crawl trouble that can, in turn, be of benefit to the website at a different point.

Dealing with duplicate content

The tasks of on-page optimization do not only entail the avoidance of duplicate content but also identifying this and acting appropriately. A so-called Duplicate Content Checker can help here. It lists the URLs that show similar content. It is particularly important for webmasters and SEOs to act appropriately in the case of duplicate content. Since indexing is always faster on search engine robots, identical content also gets on the web faster. This results in the risk of poor ranking or even a hastened exclusion from the index.

Text uniqueness

Duplicate content often affects online shops that take over product texts 1:1 from manufacturers and also use them for price comparison portals. Matt Cutts has already expressed its opinion on this topic. [1] You shoulx, therefore, create different texts for your own homepage and price comparisons or external shopping portals. Even though it may seem to be a troublesome task, individualized texts for different pages pay off - firstly, your own website and brand will be strengthened, and secondly, the price comparisons will receive individualized and therefore more interesting texts both for Google and the user.

In order to avoid online near duplicate content on one's own site, webmasters should check their content carefully and potentially consider whether some categories can be merged. In some cases, it may also be useful to mark filter pages with the tag "noindex, follow", for example. Search engines do not index these pages, but follow the links on them.

To create unique content, tools are available that take into account the formula TF*IDF.

Content theft

Should external duplicate content arise as a result of “content theft”, the respective webmaster must be contacted immediately and asked to either include the original source of the text or remove the text. In most cases, a simple request is often enough. A warning may also be issued in extreme cases. In addition, webmasters have the possibility to report pages to Google that violate copyright by copying content. This form can be submitted from the Google Search Console.

301 redirection

If external duplicate content arises because a webmaster is operating two websites with the same content on two or more domains, a 301 Redirect is often enough to prevent the duplicate content.

Another option is to let Google know the preferred version of a website via Google Search Console, for example.

Canonical tag, noindex-tag and robots.txt

There are several alternatives when dealing with internal duplicate content on one’s own website. The canonical tag is an important tool in this case. This references the duplicated subpage to the original page, and the duplicate is exempted from the indexing. If you want to be absolutely sure that a subpage with duplicate content is not indexed, you can mark it using a no-index tag. In order to additionally exclude the duplicate content from crawling, the respective subpages can also be saved in the robots.txt.

hreflang tags on translated pages

Google is now able to identify translated pages well, and can assign the content to an original page. In order to avoid duplicate content through translations or identical languages for different target markets, the tag can be used to indicate the region and language of individual URLs. This way, Google recognizes that translations of a page exist and the URL has a certain orientation.


An example: a German online shop also offers its goods in the German-speaking part of Switzerland and in Austria. In this case, the target language is German. However, the shop uses the corresponding country ending. at and. ch for the target countries. To avoid duplicate content, will be placed in the header of the German version to refer to a variant for Switzerland.

rel=alternate with mobile subdomains

Mobile Optimization can also produce duplicate content. This is especially true if the mobile website has its own subdomain. Duplicate content can then be avoided using the rel=alternate tag. The tag refers from the desktop version to the mobile version. Search engines will then recognize that the domain is the same and prevent double indexing.

Prevention

In order to prevent internal duplicate content, it is advisable to plan the page hierarchy appropriately. This makes it possible to detect possible sources of duplicate content in advance. When creating products in online shops, preparations for easy implementation of canonical tags should also be made. The following applies for the text level: The more individualized the text is, the better it is for Google and the user, and the easier it is in avoidance of duplicate content.

Duplicate Content Checker

For the first analysis, the so-called Duplicate Content Checker, such as from copyscape or Ryte are available. These tools initially identify similar or even identical content on the web. Online shops in particular which transmit their product data via CSV files to price comparison portals or sales platforms such as Amazon, are often affected by these problems. Matt Cutts has already expressed its opinion on this topic. [2]

References

Web Links