How does SEO A/B Testing really work?

Have you ever asked yourself this question? In this article, we explain the science behind the feature. We released our SEO A/B Testing feature to provide you with a data-driven approach to SEO. With SEO A/B testing, you can test whether website changes really made an impact on search performance.

A/B Testing in SEO works a little bit differently than standard A/B Testing, as you cannot tell Google to show different versions of the same page in the search results. Therefore, we compare the prediction of how a KPI would develop to actual change and identify if there is any difference. There are two common ways of doing that:

Use a set of test URLs and compare their KPIs pre- and post-test (within-test).
Use a set of test URLs and compare their KPIs to a control group of URLs that is as similar as possible (between-test).

We decided to use the second approach, as the first approach is highly dependent on the amount of past data that is available for the test group and depends very much on temporal factors (trend, season). The between-test approach on the other hand, does not require as much data (ideally 30 days of data before the date of the change would be available), but only needs a set of URLs that is similar to the test group. This control group must not exhibit the A/B change of the test group and acts as a “baseline” for the KPI. For both groups (test and control), the average change is calculated and subsequently compared. If for example there was an increase by 10% in the control group, we would have expected a 10% increase in the test group as well, if the changes do not have an influence on the KPI. If there is a difference between the two groups, we see this as the effect of the change made to the test group of URLs. But how are the averages calculated for the different KPIs?

For click-through-rate (CTR) and position the calculation is very simple. CTR is the number of total clicks divided by the number of total impressions for all pages in the segment.

(1)

For position you take the weighted average of the positions of all pages over all days. Impressions are used as weights.

(2)

(3)

The sample size in both of these cases is the total number of impressions.

The standard deviation for the CTR KPI is calculated as:

(4)

For position, the standard deviation is calculated as:

(5)

For impressions and clicks we take every page in a segment and take the sum of impressions/clicks per day as datapoints. For example, if we have 10 pages in a group and 7 days of data for each page, we will have 10 * 7 = 70 datapoints. The average (and the standard deviation) can then be calculated over all these data points.

We are using a Welch’s t-test for our setting, which is a common statistical test for settings in which either the standard deviations of the distributions or the sample size of both groups (or both) are not equal. When making predictions on the necessary sample size, we aim for a statistical power of 80% and a 95% significance. We assume that the minimum detectable effect is 10%. So changes that are lower than a 10% increase can not be detected with this approach. We decided to use this threshold so that the necessary sample size will not increase too much. Furthermore, smaller changes might have been caused by other effects or randomness. The bigger the effect the less likely it was caused for another reason

If the real increase is higher than 10%, the real increase is chosen as the effect size to calculate the necessary sample size. Once we have collected enough data, we use the Welch’s t-test to calculate the p-value. The test is significant, if the p-value is smaller than 5%, as we are using a significance level of 95%.

In a nutshell, our A/B Testing feature works by predicting how your KPIs would have performed if no change was made. These predictions are based on a set of URLs, the so-called control group, which should be as similar as possible to the test group. Results are then collected based on changes in the control group. The advantage of this approach is that there is no need for several months of data and that effects such as seasonality are not affecting the results as they are present in the control group as well.

Ryte users gain +93% clicks after 1 year. Learn how!

Book a demo

Published on May 31, 2021 by Korbinian Schmidhuber

Korbinian Schmidhuber

Korbi started working at Ryte as a Data Scientist in 2017 while studying Computational Linguistics. He has been working with Ryte since the company began doing Data Science and has therefore been involved in many large projects. In his free time, he loves to explore Japanese culture and music (especially rock and metal), or to work on improving his Japanese language skills.

Ryte users gain +93% clicks after 1 year. Learn how!

Book a demo