Search Engine Optimization has been around for many years. So far, however, our focus has been primarily on rankings and backlinks. We have monitored and exchanged links, bought links and built links by ourselves. But for years now, the search engines have been trying to convince us of a slightly different approach: ”You need good and relevant content that is especially valuable for your users”. And so we started to write content.
First there was a content with many keywords, then with fewer keywords. And then such a small number of keywords that the keyword density was always less than 4% :) But it would be naive to think that search engines like Google make use of a simple Keyword Density calculation. C’mon! Who needs Keyword Density if you can have Machine Learning. So the time has come to rethink and to think algorithmically – just as search engines do. Do you know who and which content currently ranks in the top search results, and why?
We of course all know that writing good content is key – but wouldn’t an algorithmical approach be the icing on the cake? Wouldn’t it be great to know which terms to use in order to rank better? Well, why don’t we take a look at this approach.
But let’s go back to the beginning and let’s see what it actually means to rank and how search engines consider the web documents.
Let’s write some code, implement it into a website, link it to other pages or even to other domains and make sure that search engines can crawl and index it. After a while (if you did everything right) you notice that the document has been indexed and can now be found in search engines.
There is a large number of important factors which affect your document’s ranking. Rumors told us, there are something like 400 factors on-page and about 200 factors off-page. Honestly, it’s a huge number.
Important factors mainly are:
All these factors can be changed by webmasters, because they have these at their disposal.
Great News for you: You control the quality of your documents by yourself.
Long ago we tried to change our content and manipulate the keyword density of a document.
Remember tools, which told you how many keywords are in the text and which percentage is given? Guilty!
One of the big challenges for information retrieval is to interpret and evaluate the real meaning of website content. Machines like Google wants to understand the text rather than just store keywords in a database or just count them. And what we should remember, that all the machine algorithms are based on - mathematics! So the question is, how to mathematically calculate the intent of a certain article. Engineers at Google have researched for a very long time on the topic already. Some SEOs were stuck in the keyword density era and relied on worthless metrics from the perspective of information retrieval.
“The field of information retrieval has come a long way in the last forty years, and has enabled easier and faster information discovery. In the early years there were many doubts raised regarding the simple statistical techniques used in the field. However, for the task of finding information, these statistical techniques have indeed proven to be the most effective ones so far.”
Amit Singhal, former senior VP & software engineer at Google
So smart SEOs started to be sceptical about the keyword density and started ask questions like:
“How can I determine the term frequency within a certain information retrieval system?”
Calculating the term weight isn’t a big deal, if you consult a book on information retrieval or read some university doctoral thesis, you will very often find this formula:
To explain this best we will split the term in two parts so that you can get the full understanding of the potential of this formula.
TF = (Number of times the keyword appears in the document) / (Total number of words in document)
Assuming there’s only a single document within an Information Retrieval System, the most unambiguous term signal could be very well retrieved from Keyword Density and a whole lot of Stop-terms. As we all know there is an huge number of documents in the World Wide Web.
In order to determine a term’s weighting in our document within an Information Retrieval System, we need to put it into relation to other documents of the index:
Textbook Inverse Document Frequency IDF(i) is accrued from the logarithm to the base 10 of the quotient of the document corpus N(D) and the number of documents F(i), with F(i) being the number of documents containing the term (i).
Lets imagine your reading a document where the word ”cupcake” appears 3 times.
Anyone who has read carefully so far will now wonder…
A reasonable question. For the determination of the document corpus N(D), we note that there is a certain number of results for every term (i) from our document in the search engine index:
Consequently, this needs to be done for every single term (i) within a document (j):
As soon as our document contains a term (i), our document “competes” with (n) search engine results, meaning documents that also contain this term.
All keywords found in your content are measured via the TF*IDF formula which results in a metric score referenced as term weight. This term weight is used by information retrieval systems to determine the most important terms of a document. One of the advantages of this metric:
You do not have to care about stop words anymore.
And because it's a math-based calculation, it is applicable to any language in the world.
Broadly speaking, it can be said that the document corpus N(D) is the sum of all search engine results for all terms within a document.
We are far from where calculation and methodology ends. If you think further, you will find even more details that need to be taken into consideration. Special topics such as languages, synonyms, proximity / anti-proximity and much more also needs to be taken into careful consideration.
But enough with the theoretical and mathematical stuff, let’s see how we can use it practically.
Let’s act like a Search Engine. If we were search engines, we would ask following questions to a new document, that we’ve crawled:
And then there is a document, which claims to be about “real estate”. We’ve analyzed the term weighting and discovered this is true.
Our next question to the document should be: Are you SPAM? There are a whole lot of documents in the index dealing with the same topic. And you gotta admit – most of it is crap (just think of keyword stuffing and content spinning…).
Asking for the “level of spam” is therefore legit: Anybody who steps out of line, will be out. We need to ask for a tried and tested collective. It is an economically reasonable method in order to get a pretty reliable statement.
We will therefore again consult Ryte and perform a term weighting analysis on the basis of the top 10 search engine results. Apparently, the search engine has rated this collective as “valid”. Let’s see how it’ll turn out:
We get a spam level that is term-specific, collectively defined and rated as valid by search engines on the level of term weight. The term weight of documents above this curve are therefore spam – or aren’t involved with the most unmistakeable signal.
For Ryte users: It does make sense to weaken terms within the document in order to stay within the spam level. This often resulted in documents that were found in the specific rankings (mostly for the first time).
Back on topic: You are a search engine. There is a document in front of you. It says: “This is all about high risk life insurance” and also proves it. You put all others that weren’t able to prove this back in the waiting line.
The document also proves that it behaves correctly just as “any other document”. There is no excessive amount of terms or in disparity. So it doesn’t seem to be spam.
You raise an eyebrow and question yourself:
Dear document, can you prove your most unmistakeable signal?
You can determine such self proof terms, (also known as “Proof Keywords”), with a mathematical filter on a collective of a terms weight by means of a defined number of outstanding ranking results.
We again consult Ryte for this and apply the following filter:
From our Top 10 search engine results, we now get terms which obviously and unmistakably illustrate what the term with the most distinct signal is all about.
Please note: When analyzing worse search engine results, you will detect that proof keywords get more “unreasonable” and their term weight is weaker and less meaningful.
That’s why your document should at least contain some of these proof keywords with a meaningful term weight in order to bring forward self proof.
Let’s recap quickly:
Terrific! Now there’s only one question left: Is there an admission ticket for the Top 10?
If a search engine user requests a relatively generic term in a search engine, the search engine operator is often left to only guess what’s interesting for the search engine user.
This is why it definitely makes sense to provide a collection of documents on the first page, which deal with the various different aspects of the requested topic. Some time ago, we put up the following hypothesis: “If our perspective on a topic changes, often the whole terminology changes.” Many of the subsequent (mass) tests provided one fascinating finding:
Topical diversity is measurable.
Let’s take a look at this fascinating topic with Ryte. This is how a term weighting graph (usually a distinct negative exponential function) of a very good document (this one ranks first) looks like:
With a little practice and considering a great number of documents, you can easily tell that on the first page there are particularly:
Its one of the most crucial admission tickets for the Top 10:
If your document is too similar, nothing new and/or too superficial – why should it show up on the first page?
Remember the document knocking at your door? If it complies with these four basic factors, you have already formed a neat ONPAGE basis.
Considering your content, watch out for
In the end, there’s only one thing left to say:
Adore your copywriter. Let him write the most awesome materials and reward him royally.
Your online content is the most precious, self-determined asset you have dispose of.
It is not what’s written here that determines what your are reading. You yourself determine what you are reading. we surely are responsible for not presenting you with utter nonsense here – but I’m not responsible from what you take out of it.
This article deals with a single ranking signal.
Take it easy, man. You will still need links. But still try to be just a little bit better.