Websites are said to have duplicate content when they have identical or very similar content on different URLs. Such websites do not offer any added value for search engines or website visitors.
This article will show you the various causes of duplicate content and why it is important to differentiate between the different types of duplicates. You will also receive tips on how to prevent or remove duplicate content.
According to Google:
"Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar."
(Source: Google Search Console Help)
This refers therefore to content that can be found on the different URLs or on different domains.
The most common causes of duplicate content include:
The discussion about the penalty for duplicate content is often an integral part of the conversation about duplicate content. This is an active penalty by Google if your content is detected on multiple URLs. This is what John Mueller, an analyst of Google Webmaster Trends, says about the penalty:
"There's no reason to penalize a website for having that & certainly at Google there's no duplicate content penalty when it comes to your own content." Source
According to Google, duplicate content on the same domain is not a reason to punish users. However, you waste a lot of potential with such duplicates. Google always tries to provide the best possible search result for every search request. If the best result is available on different URLs, the Google algorithm attempts to identify the best URL. Ideally, this should be the homepage. However, the algorithm can also settle for a totally wrong URL.
A large number of duplicates on a webpage leads to excessive use of crawl resources as the search engine tries to process the duplicates. In the worst case, this could result in delayed indexing of new content. You should therefore implement technical measures to try and avoid duplicate content on your website.
Figure 1: 40% of the domain’s content consists of duplicates that use up the crawl budget unnecessarily
On the other hand, duplicate content across domains is evaluated differently by Google.
A search engine is not able to tell whether content was duplicated intentionally in order to deliberately manipulate the search results.
"However, in some cases, content is deliberately duplicated across domains in an attempt to manipulate search engine rankings or win more traffic. Deceptive practices like this can result in a poor user experience, when a visitor sees substantially the same content repeated within a set of search results." Source
If Google detects identical content on different domains, it tries to identify the original version and excludes the duplicates from the search results:
Figure 2: Google hides duplicate content from the search results
Google then threatens to enforce severe penalties if it suspects that the cross-domain duplicates were created to deliberately manipulate the search results. In the worst case, this could mean general measures against the entire domain:
"... duplicate content may be shown with intent to manipulate our rankings and deceive our users, we'll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results." Source
Google makes a distinction between internal and external duplicate content. But how you can you differentiate between the two?
Ryte's Website Success enables you to very easily identify internal duplicates. Simply go to "Content" → "Duplicate Content" → "Duplicates". The report lists all duplicates that have been detected on the website as well as the number of URLs with the same content. Clicking on the magnifying glass in the "Duplicate Contents (Counter)" column shows you all URLs with duplicate content.
Figure 3: Identify internal duplicates using Ryte Website Success
Tip: Zoom also allows you to view and export all the duplicates and affected URLs to help you derive appropriate measures for all the duplicates on your website. Simply click on the gear icon in the table and choose the CSV Export option to extract the data.
You must ask yourself three important questions in order to identify external duplicate content:
1. Who is the creator of the content?
2. Is this content also used in other sections of the website?
3. Are there any partnerships or similar groups that use this content?
First, you need to find out where the content on your website originated from. Do you have your own editing team or was the content purchased? In particular, the product descriptions on online shops often originate from an automated data feed and should therefore be regarded as potential duplicate content. In such cases, it is advisable to create your own text for your most important products. However, if your inventory is always changing, it could be advisable to exclude the product pages from the search engine index and create optimized category and landing pages on which the products are listed.
Figure 4: Exact copies of the product description in numerous online shops
Big or international companies often market their products through different channels and in different countries. They often use the same descriptive texts and slogans at different points on the website in order to convey a consistent brand message.
You should therefore look at the different areas of the company that market the same product. If the same language is used for different countries, you should use the hreflang tag. For instance, this can be used to tell search engines that the English text is meant for different countries.
Figure 5: Identical content used to market Microsoft Office 2016
With good internal communication, anyone who markets their own products can restrict duplicate content to the respective domains. However, this becomes difficult in the case of corporations: A number of online shop CMS could be offering all products, including the product descriptions, via a data feed. This makes it very easy for partners to generate duplicate content when integrating the products into their online shops.
Online shops should therefore offer a separate data feed for such corporations. Instead of a data feed with one’s own CMS, it should include individualized descriptive texts. Alternatively, the partners can also request each other to use a cross-domain canonical tag or use noindex for the corresponding webpages.
Figure 6: The product feed of an online shop on ebay.com
Duplicate content can also appear on your website without your knowledge. This often happens if people manage your web content without your knowledge. http://www.copyscape.com/ is a very popular tool used to identify websites that use your content without your permission. In case you are unable to contact the operators of such websites, you should apply for their removal in the Google Search Console via the DMCA Dashboard.
Figure 7: DMCA Dashboard in the Google Search Console
Internal duplicate content results in valuable potential being wasted. Thus, you should not just reduce the number of duplicates on your website, you also need to see to the necessary technical measures.
There are different technical solutions with which you can avoid duplicate content. However, not all are suitable for solving the problem at its source. Therefore, you need to ask yourself the following:
1. Can I avoid DC, e.g., by avoiding the GET parameter?
2. Can I use a 301 redirect that points to the original version?
Only after you have implemented all possible measures based on these questions should you embark on the following solutions for removal of duplicate content. But be careful: "Fast solutions", especially for large-scale duplicate content, can quickly result in new problems.
Canonical tag: The canonical tag is a quick way to avoid duplicate content. You can add the tag at the exact position you want but never use it to get rid of large-scale duplicate content. With canonical tags, search engines must still analyze the respective URLs in order to see the tag. This can cost you a significant portion of crawl resources especially if your website has a large number of affected URLs.
noindex: This meta robots tag prevents content from being indexed by Google. Just like with the canonical tag, you should not use noindex to eliminate large-scale duplicate content since Google will still have to analyze these URLs and hence end up wasting valuable crawl resources.
Robots.txt: Using the robots.txt prevents search engines from accessing the duplicates, but you also prevent the link juice from being passed on to the respective URLs.
NoFollow: The NoFollow attribute tells search engines that they should not consider the URL. However, it does not prevent search engines from indexing it since duplicate content can also be linked from different points, both internally and externally.
Technically, a website that has the same content in the same language but in different countries still has duplicate content. This is often the case for German websites that are used in Germany, Switzerland, and Austria.
Google enables you to solve this problem by using the hreflang attribute. This references the affected URL to all language/country versions. The hreflang attribute is a very powerful tool and you should be very careful to prevent errors when using it. Simple errors could hereby result in significant ranking drops in a specific country. Below are some of the common errors that you must avoid:
When crawling your website, Ryte Website Success always checks how you used the hreflang attributes. Under "Multilingual Settings" → "Languages", you can check the languages that are referenced in a web document as its translation. This also helps you to easily identify pages that do not have any translation references.
Under "Multilingual settings" → "Status Codes", you should pay special attention to the 3xx and 4xx status codes. Always try and avoid referencing non-existent pages or pages that have redirects. Clicking on the respective status code shows you a table with the corresponding references.
Figure 9: Identify referenced redirects or faulty pages
Duplicate content is a very broad topic for which there are various solutions. Identifying the perfect solution for you requires you to identify the type of duplicate content on your website. Whereas internal duplicate content will cost you valuable potential, external duplicate content can have dire consequences for your website.
Important rules to observe when dealing with duplicate content:
In most cases, the perfect technical solution often means a lot of work. Nonetheless, this is often the most sustainable and scalable solution for eliminating duplicate content.
Identify duplicate content with Ryte for FREE
Published on 08/16/2016 by Stephan Walcher.
Stephan Walcher is a SEO specialist who has been active in the online marketing field since 2007. He has worked as an in-house SEO specialist for MSN and Bing, as head of SEO consulting at Catbird Seat online marketing agency, as senior SEO manager at 1&1 Mail & Media GmbH, and later as Head of Product Management at Ryte. In January 2017, he joined the One Advertising AG in January as Team-Leader Travel SEO.