Ryte

Algorithmical Inspirations for Better Writing

or why Text Optimization with TF*IDF is the key to success.

Try it for free

Search Engine Optimization has been around for many years. So far, however, our focus has been primarily on rankings and backlinks. We have monitored and exchanged links, bought links and built links by ourselves. But for years now, the search engines have been trying to convince us of a slightly different approach: ”You need good and relevant content that is especially valuable for your users”. And so we started to write content.

First there was a content with many keywords, then with fewer keywords. And then such a small number of keywords that the keyword density was always less than 4% :) But it would be naive to think that search engines like Google make use of a simple Keyword Density calculation. C’mon! Who needs Keyword Density if you can have Machine Learning. So the time has come to rethink and to think algorithmically – just as search engines do. Do you know who and which content currently ranks in the top search results, and why?

We of course all know that writing good content is key – but wouldn’t an algorithmical approach be the icing on the cake? Wouldn’t it be great to know which terms to use in order to rank better? Well, why don’t we take a look at this approach.

But let’s go back to the beginning and let’s see what it actually means to rank and how search engines consider the web documents.

Your document

Let’s write some code, implement it into a website, link it to other pages or even to other domains and make sure that search engines can crawl and index it. After a while (if you did everything right) you notice that the document has been indexed and can now be found in search engines.

Important Ranking Factors

There is a large number of important factors which affect your document’s ranking. Rumors told us, there are something like 400 factors on-page and about 200 factors off-page. Honestly, it’s a huge number.

Important factors mainly are:

  • Internal links
  • External links
  • Website structure
  • Website architecture
  • Technology
  • User behavior
  • and many more…

All these factors can be changed by webmasters, because they have these at their disposal.

Great News for you: You control the quality of your documents by yourself.

Why the new approach with Term Frequency?

Long ago we tried to change our content and manipulate the keyword density of a document.

Remember tools, which told you how many keywords are in the text and which percentage is given? Guilty!

One of the big challenges for information retrieval is to interpret and evaluate the real meaning of website content. Machines like Google wants to understand the text rather than just store keywords in a database or just count them. And what we should remember, that all the machine algorithms are based on - mathematics! So the question is, how to mathematically calculate the intent of a certain article. Engineers at Google have researched for a very long time on the topic already. Some SEOs were stuck in the keyword density era and relied on worthless metrics from the perspective of information retrieval.

The field of information retrieval has come a long way in the last forty years, and has enabled easier and faster information discovery. In the early years there were many doubts raised regarding the simple statistical techniques used in the field. However, for the task of finding information, these statistical techniques have indeed proven to be the most effective ones so far.”
Amit Singhal, former senior VP & software engineer at Google

So smart SEOs started to be sceptical about the keyword density and started ask questions like:

“How can I determine the term frequency within a certain information retrieval system?”

Term weighting within a simple information retrieval system

Calculating the term weight isn’t a big deal, if you consult a book on information retrieval or read some university doctoral thesis, you will very often find this formula:

TF formula

To explain this best we will split the term in two parts so that you can get the full understanding of the potential of this formula.

  • TF: Term Frequency - this measures how frequently the term is used in a single document. Simply, meaning that the longer the document the more likely that the term will be used more often. The term frequency is calculated by the document length. This will help to determine whether your keyword is stuffed in the document or not.

    TF = (Number of times the keyword appears in the document) / (Total number of words in document)

  • IDF: Inverse Document Frequency - this measures the importance of the specific term for the relevancy of the document within the corpus. However, to factor out certain terms (stop words) such as: “”is””, ““of””, and ““the”” (which will no doubt appear a fair few times and carry little importance) the calculation needs to scale down the less important terms but scale up the rare ones as follows.

And what about the Inverse Document Frequency (IDF)?

Assuming there’s only a single document within an Information Retrieval System, the most unambiguous term signal could be very well retrieved from Keyword Density and a whole lot of Stop-terms. As we all know there is an huge number of documents in the World Wide Web.

Inverse Document Frequency

In order to determine a term’s weighting in our document within an Information Retrieval System, we need to put it into relation to other documents of the index:

IDF formula

Textbook Inverse Document Frequency IDF(i) is accrued from the logarithm to the base 10 of the quotient of the document corpus N(D) and the number of documents F(i), with F(i) being the number of documents containing the term (i).

See this simple example:

Lets imagine your reading a document where the word “”cupcake”” appears 3 times.

  • TF (Term Frequency) of “”cupcake”” is (3 / 100) = 0.03.
  • There are 10 million documents and the word “”cupcake”” appears in 1,000 of these.
  • IDF (Inverse Document Frequency) is calculated log(10,000,000/1,000) = 4.
  • Therefore, the TF*IDF term importance is 0.03 x 4 = 0.12.

Anyone who has read carefully so far will now wonder…

How do we work out the document corpus?

A reasonable question. For the determination of the document corpus N(D), we note that there is a certain number of results for every term (i) from our document in the search engine index:

document corpus

Consequently, this needs to be done for every single term (i) within a document (j):
As soon as our document contains a term (i), our document “competes” with (n) search engine results, meaning documents that also contain this term.

All keywords found in your content are measured via the TF*IDF formula which results in a metric score referenced as term weight. This term weight is used by information retrieval systems to determine the most important terms of a document. One of the advantages of this metric:

You do not have to care about stop words anymore.

And because it's a math-based calculation, it is applicable to any language in the world.

Broadly speaking, it can be said that the document corpus N(D) is the sum of all search engine results for all terms within a document.

Think about it

We are far from where calculation and methodology ends. If you think further, you will find even more details that need to be taken into consideration. Special topics such as languages, synonyms, proximity / anti-proximity and much more also needs to be taken into careful consideration.

But enough with the theoretical and mathematical stuff, let’s see how we can use it practically.

The practical Usage of Term frequency

Let’s act like a Search Engine. If we were search engines, we would ask following questions to a new document, that we’ve crawled:

  • Is it spam?
  • How holistic is the document?
  • What topic is it about?
  • Is the document useful at all?

And then there is a document, which claims to be about “real estate”. We’ve analyzed the term weighting and discovered this is true.

term weighting for real estate

Our next question to the document should be: Are you SPAM? There are a whole lot of documents in the index dealing with the same topic. And you gotta admit – most of it is crap (just think of keyword stuffing and content spinning…).

Asking for the “level of spam” is therefore legit: Anybody who steps out of line, will be out. We need to ask for a tried and tested collective. It is an economically reasonable method in order to get a pretty reliable statement.

We will therefore again consult Ryte and perform a term weighting analysis on the basis of the top 10 search engine results. Apparently, the search engine has rated this collective as “valid”. Let’s see how it’ll turn out:

term weighting analysis

We get a spam level that is term-specific, collectively defined and rated as valid by search engines on the level of term weight. The term weight of documents above this curve are therefore spam – or aren’t involved with the most unmistakeable signal.

For Ryte users: It does make sense to weaken terms within the document in order to stay within the spam level. This often resulted in documents that were found in the specific rankings (mostly for the first time).

Can you prove what it is all about?

Back on topic: You are a search engine. There is a document in front of you. It says: “This is all about high risk life insurance” and also proves it. You put all others that weren’t able to prove this back in the waiting line.

The document also proves that it behaves correctly just as “any other document”. There is no excessive amount of terms or in disparity. So it doesn’t seem to be spam.

You raise an eyebrow and question yourself:

Dear document, can you prove your most unmistakeable signal?

You can determine such self proof terms, (also known as “Proof Keywords”), with a mathematical filter on a collective of a terms weight by means of a defined number of outstanding ranking results.

We again consult Ryte for this and apply the following filter:

Proof Keywords filter

From our Top 10 search engine results, we now get terms which obviously and unmistakably illustrate what the term with the most distinct signal is all about.

Please note: When analyzing worse search engine results, you will detect that proof keywords get more “unreasonable” and their term weight is weaker and less meaningful.

That’s why your document should at least contain some of these proof keywords with a meaningful term weight in order to bring forward self proof.

How different are you?

Let’s recap quickly:

suspicious document

  • You are the search engine and you open the door. There’s a document standing there.
  • Document: “I’m all about xy.” You take a look at the most unmistakeable term signal.
  • Document: “I’m no spam. Never!” You check whether the terms of the document are within the spam level defined by the documents that already are in excellent ranking positions.
  • Document: “I’m all about high risk life insurance. And coverage. And insurance sum. And dependants.” Other documents at this point tell you something about “gear”, “rubber doll” and “kinder surprise”. This document indeed seems to deal with high risk life insurance.

Terrific! Now there’s only one question left: Is there an admission ticket for the Top 10?

If a search engine user requests a relatively generic term in a search engine, the search engine operator is often left to only guess what’s interesting for the search engine user.

This is why it definitely makes sense to provide a collection of documents on the first page, which deal with the various different aspects of the requested topic. Some time ago, we put up the following hypothesis: “If our perspective on a topic changes, often the whole terminology changes.” Many of the subsequent (mass) tests provided one fascinating finding:

Topical diversity is measurable.

Let’s take a look at this fascinating topic with Ryte. This is how a term weighting graph (usually a distinct negative exponential function) of a very good document (this one ranks first) looks like:

term weighting graph

With a little practice and considering a great number of documents, you can easily tell that on the first page there are particularly:

  • Holistic
  • Substantial
  • Controversial and
  • Extremely profound documents

Its one of the most crucial admission tickets for the Top 10:

If your document is too similar, nothing new and/or too superficial – why should it show up on the first page?

How to make use of term weighting for Search Engine Optimization

Remember the document knocking at your door? If it complies with these four basic factors, you have already formed a neat ONPAGE basis.

proved document

Considering your content, watch out for

  • A clear, unmistakeable term signal
  • Complying with term weightings as found in “the best” documents
  • The self proof, as found in “the best” documents
  • A high degree of holism and rich content marked by dissonant, controversial and extremely detailed content.

In the end, there’s only one thing left to say:

Adore your copywriter. Let him write the most awesome materials and reward him royally.

Your online content is the most precious, self-determined asset you have dispose of.

It is not what’s written here that determines what your are reading. You yourself determine what you are reading. we surely are responsible for not presenting you with utter nonsense here – but I’m not responsible from what you take out of it.

This article deals with a single ranking signal.
Take it easy, man. You will still need links. But still try to be just a little bit better.

You want more? Try Ryte FREE right now.
RYTE Register for free