Shingle Algorithm

The Shingle algorithm can be used to recognize whether a text is unique. A text passage can be compared with any other and checked for duplicate content.

The algorithm

Step 1: Normalize text

The text section must be plain text. For websites, the content is in HTML code. In other words, in order to be able to apply the algorithm meaningfully to the text, all code and any formatting must be removed. In addition, it is also possible to delete fill words, which can be used to extend text artificially, for example, “nevertheless.”

Step 2: Divide text into shingles

Shingles are overlapped sentences of the text, consisting of a fixed length of words. They are superimposed on one another similar to shingles. A short example with length 3 using the sentence “This is not a creative text, but totally suffices.”

Shingle 1 = This, is, not

Shingle 2 = is, not, a

Shingle 3 = not, a, creative

Shingle 4 = creative, text, but

...

If it is too long, duplicates are overlooked. If the value is too small, a text may be quickly evaluated as duplicate content.

Step 3: Comparing shingles of different texts

A simple calculation is sufficient to determine whether two texts match. The intersection of overlapping shingles from the two texts and the combined quantity of the shingles of both texts get determined. The respective total is then divided by the respective other total. The percentage is thus calculated by dividing the number of matching shingles by the total number of shingles.

If two exactly identical texts are compared, the result is 1 and thus a 100% match. If no single shingle is identical, the counter will show 0, in other words a result of 0%.

Relevance to SEO

The uniqueness of text is a criterion according to which search engine evaluate websites. It is conceivable that Google uses the Shingle algorithm. A simpler algorithm to recognize duplicate content is the PHP function PHP similar text (), which calculates the similarity of two strings.

Web Links

Syntactic Clustering of the Web