The Shingle algorithm can be used to recognize whether a text is unique. A text passage can be compared with any other and checked for duplicate content.
The text section must be plain text. For websites, the content is in HTML code. In other words, in order to be able to apply the algorithm meaningfully to the text, all code and any formatting must be removed. In addition, it is also possible to delete fill words, which can be used to extend text artificially, for example, “nevertheless.”
Shingles are overlapped sentences of the text, consisting of a fixed length of words. They are superimposed on one another similar to shingles. A short example with length 3 using the sentence “This is not a creative text, but totally suffices.”
Shingle 1 = This, is, not
Shingle 2 = is, not, a
Shingle 3 = not, a, creative
Shingle 4 = creative, text, but
If it is too long, duplicates are overlooked. If the value is too small, a text may be quickly evaluated as duplicate content.
A simple calculation is sufficient to determine whether two texts match. The intersection of overlapping shingles from the two texts and the combined quantity of the shingles of both texts get determined. The respective total is then divided by the respective other total. The percentage is thus calculated by dividing the number of matching shingles by the total number of shingles.
If two exactly identical texts are compared, the result is 1 and thus a 100% match. If no single shingle is identical, the counter will show 0, in other words a result of 0%.
The uniqueness of text is a criterion according to which search engine evaluate websites. It is conceivable that Google uses the Shingle algorithm. A simpler algorithm to recognize duplicate content is the PHP function PHP similar text (), which calculates the similarity of two strings.