Nearly three months ago, to celebrate ‘Halloween’ in the best SEO way, I decided to start sharing some day-to-day SEO “horror“ stories with the hashtag “#SEOhorrorStories“.
The hashtag "#SEOhorrorStories" soon started trending when hundreds of SEOs around the world began to share their own stories not only in English but also in Spanish, German and French, with even Matt Cutts and the official Google Analytics account participating – you can see a little more of that in this post I made later, commenting on the participation.
However, it was curious for me that so many of these stories that shared "extreme" cases of SEO issues – and even though some were clearly humorous – were characterised as being:
This is why I would like to share different scenarios of some of the SEO horror stories that were shared and discuss how they can be prevented so that we can avoid them as much as possible – not a bad resolution for SEO in 2016:
Blocking a site to search engines might be caused by completely different scenarios: Either on purpose to try to avoid over-loading the server:
Or by mistake at a robots.txt level, when launching a website to production forgetting to remove its blocking:
This can largely be avoided by using a monitoring system such as the one offered by Ryte with its “Monitoring” functionality, which does not just send alerts via email when the robots.txt is updated, but also when the server is down.
This issue usually happens after a redesign or migration, when it is often forgotten to change the robots meta tag to enable the site indexation; however, this can also occur at any time, when updating the website:
It is essential to monitor the changes of the critical elements of the site code, as well as the content of the website pages with tools like Versionista, OnWebChange or NetWatcher, which will alert us and allow us to see the changes that are generated in them.
One of the first questions that should be asked when starting a SEO process is where the pre-production or test environment is located, not only to know if the changes can be validated before launching them (which is ideal), but also to ensure that is not accessible to search engines.
The test environment should be blocked not only through the robots.txt but also by only allowing access with a password or being accessible only through a specific DNS.
As configurations can be changed by accident, in addition to monitoring the robots.txt, I configure alerts with Google Alerts for a “site:” search of the test or pre-production environment that is usually on a subdomain. Like this I’m notified in case its content indexation is identified:
It is very common to find websites that are not redirecting or canonicalising to their original version, which is usually identified at the time of the initial SEO audit, when you check whether there is an internal content duplication issue on the website.
However, the reality is that site changes and continuous updates can cause this problem at any time.
SEO crawlers will facilitate this validation by offering a report directly listing pages with duplicate content:
It is therefore essential to schedule frequent & continuous validations to avoid that changes we have not been alerted about, generate new content duplication issues.
This can occur, for example, when migrating from http to https, without you even being notified:
Or when someone publishes the same content on multiple pages without letting you know:
To tackle this, you can schedule frequent crawls – a function offered by most SEO crawlers such as OnPage Zoom – so that once the SEO process has begun it continuously checks for internal duplicate content that could be generated along the way:
This is one of the scenarios that generates duplicate content, but it is so common – and is a problem in itself that has other consequences too -that deserves a separate point and can be summarised in these two tweets:
And not just forgetting or not implementing the 301 redirects from the old URLs to the new ones - referring each page to its new version - but also updating the links, XML sitemaps, registering the new site on the Google Search Console or notifying the change of address in Google Search Console, among other steps that Google itself specifies in this best practices guide for migration.
By following an action plan that must start before launching, when planning the website migration, and not just at the time of and after the launch.
Besides Google’s best practices, here you have a couple of resources and guides for an SEO-friendly website migration to avoid losing all your organic traffic and rankings:
As happens with the blocking or erroneous deindexation, problems with redirects are common when redesigning websites, and not just by forgetting to migrate the old URLs to the new ones, but also by not doing it in the relevant way.
For example, massively redirecting mobile users (and search engine robots) to the desktop version of the site:
Redirecting to error pages:
Redirecting chains not just with permanent 301 redirects, but also with temporary 302 redirect chains, which do not transfer the popularity of the old page to the new one:
Similar to the above problems, these types of redirects issues are usually identified and solved when doing the technical audit at the beginning of the SEO process, using a crawler such as, in this case, OnPage, that directly shows us the type of redirects and the URLs to which they are being redirected:
Additionally, in the Google Search Console we can identify non-relevant redirects that are marked as "Soft 404 errors.", pages that are redirected to those with errors would be in the "Not found" section, those that redirect to blocked pages would be found under "Access Denied", and those that are redirected using 302 in "URLs not followed" in the "Crawl Errors" Report.
What’s most important is to not forget to check this setting regularly with re-scans and frequent validations; and to not only solve the incidents that have been identified, but also figuring out what causes the redirect issues.
Something similar happens with erroneous canonicalizations, when canonical tags point to pages that are not their original versions, which can happen in many different scenarios:
When all the website's pages’ canonical tags point to the same URL, for example, the home page:
When IP numbers are included instead of the site domain name:
When canonical tags are kept in the pages of a newly launched site pointing to their preproduction URLs:
Or when an international version of a website is launched on a new ccTLD that canonicalizes to the initial gTLD:
These problems with canonical tags are usually also identified at the time of doing the technical audit with an SEO crawler, verifying which pages are canonicalized to others (those that are not pointing to themselves) and to which URLs are they pointing to, in order to check if they really are their original version, if they generate errors, if they point to another website, or other URLs that are not relevant, etc.:
Likewise, recurring scheduled scans should be enabled so they are regularly validated.
Hiding content as a cloaking technique is a "classic" (of what’s not recommended as it goes against Google guidelines, of course), however it is still used – sometimes on purpose, sometimes not – to show different content to the search engines than the users:
As fundamental as it might seem, it is essential to always check how the page content is indexed (especially at the beginning of an SEO process with a new website) and what is shown in the search engine cache itself, in the text version:
As well as using the functionality of “Fetch as Google” of the Google Search Console to verify potential differences in the content of the site:
Once you have started the process, when monitoring content and HTML changes of the pages as discussed above in section 2, we can be alerted of potential changes that occur in the coding or content of the pages, focused in this case on hiding and "cloaking" it.
Making use of a CDN can really help improve the speed of a website. However, without proper configuration you may risk of either indexing the copies of the website's content usually enabled in subdomains for the CDN, or, blocking entirely the crawling of the subdomains, including the access to files distributed through them that should be accessible to search engines, such as images, JS & CSS -otherwise, the pages will not be rendered correctly.
The easiest way to avoid this problem is by using the settings focused on SEO that CDNs usually offer, and enable the inclusion of a canonical header to the files that are “duplicated” through the CDN's different subdomains so they point to their original URLs.
That should be enough, but additionally, if you want to prevent the crawling of the content on these subdomains and only leave it enabled for static files such as images, JS & CSS -which are the ones usually served through the CDNs- you can also configure a custom robots.txt for them.
To verify if the Googlebot can successfully access the content and files from your website you can make use of the functionality of “Fetch as Google” on the Google Search Console.
To avoid this situation, it is recommended to verify the correct crawling and indexing of all the critical elements and areas of your website such as navigation and content; they should be always implemented directly in the HTML and do not rely on scripts.
Yes, they can be very funny if they do not happen to you, but it is definitely best to avoid them. Here's to a year 2016 without SEO horror stories!
Published on 01/28/2016 by Aleyda Solis.
Aleyda Solis is a well known International SEO Consultant, helping businesses to grow their organic search visibility, traffic and Web ROI with her company, Orainti. Aleyda specializes in International, technical and multi-device SEO projects and has more than 8 years of experience in search engine optimization. In addition to consulting, she is a frequent speaker at online marketing conferences worldwide.