How to Find Duplicate Content on Your Website

Duplicate content can negatively affect search engine optimization. When two pages of a website are too similar, Google does not know which page is more important, and therefore does not know which page should be included in the index.

What is duplicate content and what are the causes?

Duplicate content is created when the same content of a page can be found on different URLs. There are many reasons why duplicate content can arise on websites. For example:

There are two different versions of a URL due to a print-friendly version
There are two versions of a URL due to having a version in http, and one in https.
The same content is used in different categories.
Similar or duplicate content appears on different categories (such as in online shops)

If you have duplicate content on your site, you should make sure it’s eliminated. There are many ways of doing this, for example with a canonical tag. However, firstly it’s important to identify where you have duplicate content on your website. In this article, we’ll show you how.

How to identify duplicate content

It’s easy to identify duplicate content with Ryte’s report. The report identifies duplicate content on your website with a fingerprint which is calculated based on the site’s content. The fingerprint is only used for a site’s content and not for the source code. Additionally, we get rid of all numbers before calculating the fingerprint, as a single number could potentially change the fingerprint. If one site displays different metrics, for example “How fast was the site loaded?”, this would result in a different fingerprint at each time – this is prevented by excluding numbers.

We then use this fingerprint to compare it with another page. As soon as we find another URL with the same fingerprint, we inform our users about it in the “Duplicate Content Report”. We only compare indexable pages: Pages that point to another page with Canonical or similar (Robots.txt Block, Noindex, …) are not included.

What is “Similar Content”?

After reading the definition of “Duplicate Content”, you’ll recognize that our just one varying number will mean that two pages are no longer classed as duplicate content. That’s why we also have a report: “Similar Content”.
The goal is to detect very similar pages, that for example differ in 2-3 sentences and do not offer any added value. Another example would be product pages such as “Adidas Shoe Size 39” and “Adidas Shoe Size 40” – the only difference is the indication of the size, but actually, there's no added value.

This is how the report is laid out: the graph accumulates the number of all found Similar pages Near Duplicates per page.

Figure 1: Find duplicate content

If you click on the more detailed view (magnifying next to the number in the list), you will see the pages with duplicate content.

Figure 2: Find similar content

If you are acquainted with Google’s patents, you’ll know that "Similar Content" algorithms are very important. They help search engines with the adjustment of their crawlers. If a site repeatedly displays similar content which do not provide any added value, search engines will prefer to invest their resources into domains where they are more likely to find valuable content.

If a page shows too much resemblance with a previous page, it results in crawlers ignoring these sites and their links.

How to solve the issue of duplicate content

All in all, it’s easy to solve the issue of duplicate content. After identifying your duplicate and similar URLs, you should firstly decide whether both URLs are really necessary - would it be enough to combine the content into one page? If both URLs are necessary, the easiest way to solve the issue is to have a canonical tag pointing to the most relevant page. A canonical tag shows Google which page is more important, therefore which page should be indexed. You can find more ways of handling duplicate content in this article.

Conclusion

Duplicate and similar content on your website can negatively affect your website from an SEO point of view, as it is no clear to Google which URL is more important. Therefore, you might find that the less important page ends up in the index. With Ryte’s report, you can easily identify your duplicate and similar content.

Ryte users gain +93% clicks after 1 year. Learn how!

Book a demo

Published on Nov 12, 2018 by Olivia Willson

Olivia Willson

After studying at King’s College London, Olivia moved to Munich, where she joined the Ryte team till 2021. She was previously in charge of product marketing and CRO, and also helped out with SEO and content marketing. When she's not working, you can usually find her outside, either running around a track, or hiking up a mountain.

Ryte users gain +93% clicks after 1 year. Learn how!

Book a demo

How to Find Duplicate Content on Your Website

What is duplicate content and what are the causes?

How to identify duplicate content

What is “Similar Content”?

How to solve the issue of duplicate content

Conclusion

Ryte users gain +93% clicks after 1 year. Learn how!

Ryte users gain +93% clicks after 1 year. Learn how!

Our recommendations

Get Set Up for Online Success in 2022

Get weekly and monthly search performance forecasts

Improve your workflow with our brand new Slack Integration