« Back to front page

Do’s and Don’ts – The correct way to handle duplicate content

Websites are said to have duplicate content when they have identical or very similar content on different URLs. Such websites do not offer any added value for search engines or website visitors.

This article will show you the various causes of duplicate content and why it is important to differentiate between the different types of duplicates. You will also receive tips on how to prevent or remove duplicate content.

What is the cause of duplicate content?

According to Google:

"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar."
(Source: Google Search Console Help)

This refers therefore to content that can be found on the different URLs or on different domains.

The most common causes of duplicate content include:

  • Print versions of URL webpages

  • Online shop products that are provided via a product feed

  • Failure to use a standard domain structure (e.g., www.domain.tld vs. domain.tld)

  • nconsistent URL structure

GET parameters

Small and capital letters

Traling slash

Session IDs

  • PDF versions

Duplicate content penalty

The discussion about the penalty for duplicate content is often an integral part of the conversation about duplicate content. This is an active penalty by Google if your content is detected on multiple URLs. This is what John Mueller, an analyst of Google Webmaster Trends, says about the penalty:

"There's no reason to penalize a website for having that & certainly at Google there's no duplicate content penalty when it comes to your own content."
Source

According to Google, duplicate content on the same domain is not a reason to punish users. However, you waste a lot of potential with such duplicates. Google always tries to provide the best possible search result for every search request. If the best result is available on different URLs, the Google algorithm attempts to identify the best URL. Ideally, this should be the homepage. However, the algorithm can also settle for a totally wrong URL.

A large number of duplicates on a webpage leads to excessive use of crawl resources as the search engine tries to process the duplicates. In the worst case, this could result in delayed indexing of new content. You should therefore implement technical measures to try and avoid duplicate content on your website.

Pasted-image-at-2016_08_12-13_49 identical content Duplicates Duplicate Content

Figure 1: 40% of the domain’s content consists of duplicates that use up the crawl budget unnecessarily

On the other hand, duplicate content across domains is evaluated differently by Google.

A search engine is not able to tell whether content was duplicated intentionally in order to deliberately manipulate the search results.

"However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results."
Source

If Google detects identical content on different domains, it tries to identify the original version and excludes the duplicates from the search results:

Bildschirmfoto-2016-08-12-um-10.56.12 identical content Duplicates Duplicate Content

Figure 2: Google hides duplicate content from the search results

Google then threatens to enforce severe penalties if it suspects that the cross-domain duplicates were created to deliberately manipulate the search results. In the worst case, this could mean general measures against the entire domain:

"... duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results."
Source

How can you identify duplicate content?

Google makes a distinction between internal and external duplicate content. But how you can you differentiate between the two?

Identify internal duplicate content

Ryte's Website Success enables you to very easily identify internal duplicates. Simply go to "Content" → "Duplicate Content" → "Duplicates". The report lists all duplicates that have been detected on the website as well as the number of URLs with the same content. Clicking on the magnifying glass in the "Duplicate Contents (Counter)" column shows you all URLs with duplicate content.

Pasted-image-at-2016_08_12-13_50 identical content Duplicates Duplicate Content

Figure 3: Identify internal duplicates using Ryte Website Success

Tip: Zoom also allows you to view and export all the duplicates and affected URLs to help you derive appropriate measures for all the duplicates on your website. Simply click on the gear icon in the table and choose the CSV Export option to extract the data.

Identify external duplicate content

You must ask yourself three important questions in order to identify external duplicate content:

1. Who is the creator of the content?

2. Is this content also used in other sections of the website?

3. Are there any partnerships or similar groups that use this content?

1. Who is the creator of the content?

First, you need to find out where the content on your website originated from. Do you have your own editing team or was the content purchased? In particular, the product descriptions on online shops often originate from an automated data feed and should therefore be regarded as potential duplicate content. In such cases, it is advisable to create your own text for your most important products. However, if your inventory is always changing, it could be advisable to exclude the product pages from the search engine index and create optimized category and landing pages on which the products are listed.

Bildschirmfoto-2016-08-12-um-10.35.01 identical content Duplicates Duplicate Content

Figure 4: Exact copies of the product description in numerous online shops

2. Is this content also used in other sections of the website?

Big or international companies often market their products through different channels and in different countries. They often use the same descriptive texts and slogans at different points on the website in order to convey a consistent brand message.

You should therefore look at the different areas of the company that market the same product. If the same language is used for different countries, you should use the hreflang tag. For instance, this can be used to tell search engines that the English text is meant for different countries.

abbildung5-1001x1024 identical content Duplicates Duplicate Content

Figure 5: Identical content used to market Microsoft Office 2016

3. Are there any partnerships of similar groups that use the same content?

With good internal communication, anyone who markets their own products can restrict duplicate content to the respective domains. However, this becomes difficult in the case of corporations: A number of online shop CMS could be offering all products, including the product descriptions, via a data feed. This makes it very easy for partners to generate duplicate content when integrating the products into their online shops.

Online shops should therefore offer a separate data feed for such corporations. Instead of a data feed with one’s own CMS, it should include individualized descriptive texts. Alternatively, the partners can also request each other to use a cross-domain canonical tag or use noindex for the corresponding webpages.

Bildschirmfoto-2016-08-12-um-11.10.15 identical content Duplicates Duplicate Content

Figure 6: The product feed of an online shop on ebay.com

Duplicate content can also appear on your website without your knowledge. This often happens if people manage your web content without your knowledge. http://www.copyscape.com/ is a very popular tool used to identify websites that use your content without your permission. In case you are unable to contact the operators of such websites, you should apply for their removal in the Google Search Console via the DMCA Dashboard.

Bildschirmfoto-2016-08-12-um-10.26.34 identical content Duplicates Duplicate Content

Figure 7: DMCA Dashboard in the Google Search Console

How should you deal with internal duplicate content?

Technical solution for duplicate content

Internal duplicate content results in valuable potential being wasted. Thus, you should not just reduce the number of duplicates on your website, you also need to see to the necessary technical measures.

There are different technical solutions with which you can avoid duplicate content. However, not all are suitable for solving the problem at its source. Therefore, you need to ask yourself the following:

1. Can I avoid DC, e.g., by avoiding the GET parameter?

2. Can I use a 301 redirect that points to the original version?

Quick way to deal with duplicates

Only after you have implemented all possible measures based on these questions should you embark on the following solutions for removal of duplicate content. But be careful: "Fast solutions", especially for large-scale duplicate content, can quickly result in new problems.

Canonical tag: The canonical tag is a quick way to avoid duplicate content. You can add the tag at the exact position you want but never use it to get rid of large-scale duplicate content. With canonical tags, search engines must still analyze the respective URLs in order to see the tag. This can cost you a significant portion of crawl resources especially if your website has a large number of affected URLs.

noindex: This meta robots tag prevents content from being indexed by Google. Just like with the canonical tag, you should not use noindex to eliminate large-scale duplicate content since Google will still have to analyze these URLs and hence end up wasting valuable crawl resources.

Robots.txt: Using the robots.txt prevents search engines from accessing the duplicates, but you also prevent the link juice from being passed on to the respective URLs.

NoFollow: The NoFollow attribute tells search engines that they should not consider the URL. However, it does not prevent search engines from indexing it since duplicate content can also be linked from different points, both internally and externally.

Special case: Identical content for different countries

Technically, a website that has the same content in the same language but in different countries still has duplicate content. This is often the case for German websites that are used in Germany, Switzerland, and Austria.

Google enables you to solve this problem by using the hreflang attribute. This references the affected URL to all language/country versions. The hreflang attribute is a very powerful tool and you should be very careful to prevent errors when using it. Simple errors could hereby result in significant ranking drops in a specific country. Below are some of the common errors that you must avoid:

  • Mistakes when specifying the language and country → use the correct ISO code

  • Referencing non-existent URLs → avoid 404 or redirects

  • Using conflicting hreflang attributes at different points on the websites → only add the attribute once

When crawling your website, Ryte Website Success always checks how you used the hreflang attributes. Under "Multilingual Settings" → "Languages", you can check the languages that are referenced in a web document as its translation. This also helps you to easily identify pages that do not have any translation references.

Under "Multilingual settings" → "Status Codes", you should pay special attention to the 3xx and 4xx status codes. Always try and avoid referencing non-existent pages or pages that have redirects. Clicking on the respective status code shows you a table with the corresponding references.

Pasted-image-at-2016_08_12-12_351 identical content Duplicates Duplicate Content

Figure 9: Identify referenced redirects or faulty pages

Conclusion

Duplicate content is a very broad topic for which there are various solutions. Identifying the perfect solution for you requires you to identify the type of duplicate content on your website. Whereas internal duplicate content will cost you valuable potential, external duplicate content can have dire consequences for your website.

Important rules to observe when dealing with duplicate content:

  • Only use permanent 301 redirects if possible

  • Stick to a standard URL structure such as all URLs end with a trailing slash or .html

  • Define a standard domain – pick a specific domain variation

  • Use the hreflang tag for international websites

  • In case of collaborations, pay attention to how the content is used and offer different variants

  • Avoid recurring text blocks

In most cases, the perfect technical solution often means a lot of work. Nonetheless, this is often the most sustainable and scalable solution for eliminating duplicate content.

Ryte users gain +93% clicks after 1 year. Learn how!

Published on Aug 16, 2016 by Stephan Walcher