« Back to front page

Small File With A Large Soul – htaccess and SEO

Configuring an htaccess file can seem a tedious task especially if you’re not familiar with it. And in all honesty, if you’re not familiar with it you should tread carefully as you can bring a whole website down quite easily with entering a wrong rule or even just a wrong character out of place. Not trying to scare, just stating a fact. In this article, we shall focus on how an SEO can benefit from it in a somewhat safe manner. After all, it is quite a handy little file for many things and I’ll inform you of a few common use cases for it.

What is an htaccess file?

An oversimplified description for the hypertext access file is that it is a subset of Apache webserver configuration rules on the directory-level. A single website can have multiple of these on each directory, but in this article, we will concentrate on the webroot instance. You can do more advanced stuff by knowing this, but using the webroot file is usually more than enough for most.

In other words, you can override or add to Apaches’ global configuration with htaccess. Common use cases are for example URL redirection, URL shortening, access control (for different web pages and files) or customized error responses. You can do a whole lot more with the other configuration rules as well. If you’re looking for more advanced cases I'm planning on writing a more detailed guide on my blog. Having the file on webroot means we can control rules within a single domain without changing rules on any other domains on the same server. Starting to see the benefits already?

Even if it might look like a programming language to many, the file consists of Apache Directives which are a variant of PCRE (Perl Compatible Regular Expressions). Don’t worry, you don’t need to learn new languages and in this context, it is just a format that htaccess uses in its’ ruleset. Just letting you know in case you want to look for more information around it and acronyms are just always fun. The web is full of detailed and explained examples so only minor modifications should get you going.

The file also has nothing to do with the app that hosts your site, consider it as a separate file from the likes of WordPress, Magento, Prestashop, etc. It is tied to the web server itself and has nothing to do with the CMS or eCom platform you are using. The same rules apply regardless of the system running your site. The only thing that might change is the directory structure when redirecting or blocking access within the rules. So it doesn’t matter if your site is built in PHP, Ruby or Javascript. The functionality of the hypertext access file stays the same.

As you might’ve figured out .htaccess is mainly used by Apache webservers. If you are using nginx (which is another highly popular webserver software) the setup is a bit different so the rules don't match exactly. So maybe the first thing to check is to find out which webserver is running your site before going forward.

How does htaccess work?

So the filename is .htaccess and by default, it needs to be exactly that to work. Actually, the name is htaccess and the period or full stop in the front of the filename means it is a hidden file. This is a common way to hide files in UNIX/LINUX environments. So you need to know how to see hidden files in your system for .htaccess to be shown.

The file itself is a simple text file. Use a proper editor when editing or creating it on your local machine, for example, Notepad++, Atom or Sublime Text. I prefer to have it in character encoding of UTF-8 because I’m old school and years ago charsets were kind of a hassle so after UTF-8 I’m sticking with it. If you edit it online through cPanel it usually has correct settings by default. If you need to transfer it with FTP or SFTP, make sure the transfer is done in ASCII mode. I’m not going into details with configurations of FTP software and such, I assume you know your way around those already.

It seems a lot of small things, but after a while, you gain a routine around this and usually, the settings need to be set once per software. It is just important to know these because wrong settings can screw up your shiny new file and it won’t work properly when set on the server. With .htaccess devil is indeed in the details.

So, now we have a clean file. Let’s start taking advantage of it. If you have content in the file already (which is probably due to your current website software like WordPress and such), don’t worry we’ll go through the basics next.

Structure of the file content

I'll take the basic WordPress htaccess as an example to go through to get you familiarised with the syntax. You’ve probably seen this many times and wondered what is happening here. So let’s split it open!

# BEGIN WordPress

RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

# END WordPress

Firstly you can see # character is for commenting per line. Comment ends on a line break or commonly by pressing enter. Htaccess does not support comment blocks so if you need to comment on multiple lines each comment line starts with #.

Here we make sure Apaches' rewrite engine is on for this website. It is a separate module that is not in Apache core and it can be set on server-level in global conf file httpd.conf or as we do here per-directory context through htaccess. If this is set on server-level we do not necessarily need this repeated here.

RewriteBase /

Here we set the base for later rewrite rules to handle as the base. It is required for you to use relative paths. This can become slightly more hard to understand if you have several htaccess on several directories in play, but let’s assume you have only one on webroot. The / character, in this case, means the document root where the website files are located. Usually, it means the same as https://www.example.com/ when translated into the domain URL.

RewriteRule ^index\.php$ - [L]

Rewriting rules are where the actual rewriting magic happens. You can have a virtually infinite amount of them, but remember that these are done on-the-fly. This means that the rules are checked several times on each pageview on the site. You can have quite a few of them before this has any effect in the real world especially if your web server uses SSD drives (and I have no idea why it would not). But as a general rule, it is a good idea to keep the file as slim as possible.

The syntax of the rewrite rules is as follows:

RewriteRule Pattern Target/Substitution [Flag1, Flag2, Flag3]

Flags are optional modifiers to the rule that can change the behavior of the rule. For example, we can set a cookie when a rule is matched by the CO flag and give the cookie some keys and values, we can set some pages to return 403 forbidden error code with the F flag and set MIME types to certain files with the T flag.

This can get very technical very fast, but this was just to show how versatile the rule is. Let’s get back to the rule at hand.

So here we search for index.php file and nothing else with ^index\.php$. This is a regular expression format that is very powerful in finding very versatile groups of strings basically and useful also in GA with limitations though. Google has stripped regex a bit, but still very useful there as well. So it might be a good idea to get at least slightly familiarised with it.

Ok, let’s go through the pattern string, ^ stands for the beginning of a string. If you want to search for a string that has a dot in it, you need to escape the character with \ character. And $ marks the end of a string. I hope this makes any sense. But ^index\.php$ has only one match which is “index.php” and that is it. It is an exact match search so to speak.

Then as we can see our target or substitution is nothing but -. This means there is no substitution. And then, in the end, there is a flag [L] which tells to stop processing the rules.

Ok so we looked for index.php file and set a rule to do nothing, what was the point? The point was to prevent later rules to do rewriting to index.php. We don’t want that, so it is kind of a safeguard for index.php. We, in fact, want index.php to handle our permalinks in WordPress. In other words, if the browser comes to index.php we don’t want any later rules to act further and let the CMS or whatever do its thing.

As you can see in htaccess order matters. You need to have the rules in a certain order to achieve the goals you have. Just to keep that in mind as we move forward.

RewriteCond %{REQUEST_FILENAME} !-f

Rewrite conditions is a way to restrict the types of requests that the rewrite rule AFTER them will effect upon. For example, you can have certain redirects set for only a certain set of IP addresses.

The syntax of the rewrite conditions is as follows:

RewriteCond Teststring Condition [Flag1, Flag2, Flag3]

When more than one condition is set, they all have to match or be true before the next rule is applied. Test string is usually in this context server variables like the %{REQUEST_FILENAME} is. Request filename contains the full URL of a requested file from the webserver and is set on the server-level.

Condition, in this case, is !-f. -f means directly “is regular file”, So it checks if the test string is a valid file. But the ! in front negates this so it actually checks if the test string is NOT a valid file. Makes sense?

Ok, I know this might be a lot. Let’s recap. So the rule

RewriteCond %{REQUEST_FILENAME} !-f

Checks if the full URL of which a visitor is accessing is not a valid file. If it is not we move forward. This is a way to catch the rewrite paths because permalinks aren’t real files or directories. We need to have them processed or the server just returns 404 errors all around. This is how we catch them and route them to the CMS etc. for further processing.

Still awake? Nice, let’s go to the next one.

RewriteCond %{REQUEST_FILENAME} !-d

Ok, here we set the same condition as before we did for files, now we do it for directories. Because of course, they are not the same thing.

So now we have two conditions set which check if a requested URL is not a valid file AND not a valid directory. If both match and only then we process the next rule.

So this is the rule we process when the conditions are met. In other words, if some are trying to access something on our website that does not exist aka a permalink in WordPress lingo. Here the pattern is a simple dot which means any character except linebreak. But since linebreaks are rare in the context of a URL it basically means everything. And now the target is just /index.php which is the index.php file on the webroot which we protected before from being rewritten further. And again [L] stops the processing of any more rules. This is usually why your rules might not work if you put them below this or in between the conditions. Order matters here, folks.

So what did we actually set the server to do here? Our end goal is to let WordPress handle the permalinks on its end. For it to do that we rewrite every request that doesn’t point to an actual file or directory to /index.php and make sure when a request comes to index.php it stops there. That is it. Then WordPress takes the charge and does its magic.

What can you use .htaccess files for?

I’ll list a few use cases so you’ll have a more robust idea of what the small file is capable of beyond URL redirection, URL shortening, access control or customized error responses. Let’s keep the relevancy in SEO though.

  • Simple authorization and authentication: Handy for example when making sure a staging site won’t flow into Google index ahead of time. Sure you should noindex your staging sites, but indexing a staging site would suck so bad it might be a good idea to make sure it won't with simple authorization.

  • URL rewriting: Yep, so-called “pretty URLs” use .htaccess as well in many cases. Or at least enables CMS to handle all of it.

  • Directory listing: We can control how the server will react when no specific web page is set. Let’s say we have a bunch of pdfs we don’t want to be listed. We can control if the listing is shown or not through htaccess.

  • HTTPS & HSTS: Implementation of both HTTPS and HSTS on Apache servers is largely dependent on correct URL rewriting & header information mentioned in .htaccess file. Any incorrect syntax in the file while deploying HTTPS or HSTS leads to a failure in implementation.

  • Error messaging: Errors happen no matter what we do, but it is nice to give visitors themed messaging of what is going on. Through htaccess, we can give just that.

  • Redirects: We SEOs love our 301s don’t we? We can set them here as well.

  • Blocking: If we have for example a nasty bot roaming on our sites, we can block it here by IP address. We can also block traffic coming from certain sites with a referrer. Fun fun fun.

And even more… as you can see this small file has some character, badum-tsss.

I would love to go through every use case with examples, but this article would become boring as hell so let’s move on. You can always google examples or ask for more. The idea is to give ideas on how to benefit from this little file as an SEO.

Not always everything is wise to handle through .htaccess because it does have a few disadvantages like:

  • Performance loss: Less of an issue after SSD drives got popular but good to know. For each HTTP request made to the server, there is a file-system access which is the bottleneck in most situations, and the web server checks the rules on every access on each directory as well. In other words, the disk is used every time a resource call is made at least once which can add up quite quickly to hefty disk usage. There are ways to optimize this though. And unless the file starts to be physically big or there are many of them the performance loss is insignificant in most cases.

  • Security: You need to have a proper setup in place or security could be compromised. Always make sure your hosting is top-notch on these matters.

  • Syntax: Like I said before, with .htaccess devil is in the details. In this context, it means that even a single wrong character can break your site or parts of it. Misspellings can easily lead to 501 server errors and that is never fun. You can always take the whole file offline by changing the name to for example _htaccess or to htaccess.old. That will reset everything made in the .htaccess file and in many cases brings the site back online.

So as you can imagine this small file has more firepower than we can handle. But the versatility and numerous use cases of it can be an asset you may need to use one day to ease up your own tasks. Be bold, test it out and remember that if you bring the site down, you can reset everything by renaming the file. If you already have a working file on your webroot it is a VERY good idea to copy the working version before you go in and do your magic. That way you can easily revert all the changes you just made. Like I mentioned before I'm also planning on covering this topic in more detail on my blog later on so if you have read this far, you might be interested to check it out as well.

Further reading:

Finally, I want to thank my friend Aarne Salminen for helping me clarify/advise on technical aspects of this article.

Improve your website for FREE with Ryte

Start now!

Ryte users gain +93% clicks after 1 year. Learn how!

Published on Jan 7, 2020 by Suganthan Mohanadasan