Thumbnail Image

Date Posted: Oct 07 2016

Tags: SEO, Usability, Apache, Semantics

Clean URL Rewrites Using Apache

This article will cover how to easily implement Clean URLs (also known as Semantic URLs, RESTful URLs, User-Friendly URLs and Search Engine-Friendly URLs) using Apache Web Server; currently the most popular web server platform worldwide.

Not using Apache web server? No problem, NGINX users can check out my article Clean URL Rewrites Using NGINX.

A Clean URL is a URL that does not contain query strings or parameters. This makes the URL easier to read and more understandable to users.

Clean URLs are a high ranking factor for many search engines, but there are many other reasons why having Clean URLs can be important to your website. Check out our article Are your URLs Letting Your Website Down for more information.

THE PROBLEM

Take a look at this URL used by a dynamically generated web page:


http://www.exampleshop.co.uk/products.php?id=7

The example shows a common URL structure that you may often see output from a PHP driven CMS (Content Management System). The URL links to a dynamic PHP web page called products.php with a variable value of 7. A dynamic web page is a web page that is constructed using server side scripting, for example PHP. Server side scripting allows you to use URL parameters to determine the assembly of a dynamic web page, as shown in the example ‘id=7’.

Functionally this URL is perfect; it links to the dynamic products page, and passes the variable ‘id’ with the value of 7 to the webserver to dynamically generate the content for the product associated with ID value of 7.

This is great but the URL seems a bit archaic, difficult to remember, it contains no relevant keywords for search engines; and it doesn’t describe the content of the webpage very clearly.

The webpage in the example is actually a shop page for buying organic apples, but there is no way of telling this from the current URL.

Here is an example of a more ideal URL:


http://www.exampleshop.co.uk/buy/organic-fruit/apples

Our example now contains multiple keywords for search crawlers and a clear description of the contents of the webpage.

But how is the webserver supposed to handle this Clean URL? There is no reference to the dynamic page ‘products.php’ or the variable ‘id’ of the product for the webserver to pull the correct dynamic content from.

THE SOLUTION

The easiest solution is to use a rewrite engine to modify the appearance of the URL, most commonly known as URL Rewriting or URL Manipulation.
Fortunately, this solution only requires minimal restructuring or renaming of folders, dynamic files or variables. URL Rewriting is very flexible and fast to implement.

WHAT IS URL REWRITING?

URL Rewriting is the technique used for URL mapping or routing in a web application. By using URL Rewriting we can provide the information our webserver requires to interpret our Clean URL in our previous example.

APACHE HTTP SERVER

Released in 1995 Apache HTTP Server, also canonically known as Apache, has remained the most used webserver software worldwide for over 20 years. It was also the first web server software to serve more than 100 million websites.
Apache is available for both Windows and UNIX server operating systems.

INITIATING THE REWRITE ENGINE

To begin we need to first initiate the rewrite engine.

  • Navigate to your webserver’s root directory.
    The root directory is normally the location where you place your website’s public files. The location of this directory may vary depending on your webserver’s platform, refer to your platforms documentation or contact your webhost if you are unsure where this directory is located.
  • In your root folder create the file ‘.htaccess’ (This file must be named exactly ‘.htaccess’, not ‘filename.htaccess’ or ‘.htaccess.txt’). If the file already exists do not create a new one, you can modify the existing file. Open your ‘.htaccess’ file with a text editor (Such as Sublime Text).
  • To initiate the rewrite engine add the following line:

RewriteEngine on

BASIC URL REWRITING

Now that we have initiated our rewrite engine, let’s try creating a basic rewrite rule.

We have for example the following URL:


http://www.example.co.uk/1212aJlmo.html

But we want our users to instead be able to access this page via this URL:


http://www.example.co.uk/photoshop-tutorials

We can use the basic rewrite rule below to accomplish this:


RewriteEngine on
RewriteRule ^photoshop-tutorials?$ 1212aJlmo.html [NC, L]

Beneath RewriteEngine on we have added a new line:


RewriteRule ^photoshop-tutorials?$ 1212aJlmo.html [NC, L]

Now let’s break this line of code down and take a closer look at how it works.

  • RewriteRule – Tells Apache that the following refers to one single Rewrite Rule.
  • ^photoshop-tutorials?$ - The ‘pattern’ that the webserver will look for in the URL, if found the webserver will swap the pattern for the following substitution.
  • 1212aJlmo.html – The ‘substitution’, the webserver will swap the pattern for the substitution if the pattern is found in the URL.
  • ^, ? and $ - These are Regular Expression, also known as Rational Expression, characters; they are a sequence of characters that define a search pattern and are mainly used in pattern matching and string matching. The pattern is treated as a regular expression by default. In our example pattern we are using three regular expression characters:
    • ^ represents the beginning of a string.
    • $ represents the end of a string.
    • ? is known as the non-greedy modifier. In our example this modifier will stop our regular expression from repeating after matching our pattern for the first time, this is ‘non-greedy’ behaviour. ‘Greedy’ behaviour would be to look for more pattern matches.
  • [NC, L] – These are known as flags. Flags are added to the end of the rewrite rule and tell Apache how to interpret the rule. In this example the NC (no case) flag lets Apache know that the rewrite rule is case insensitive. The L (last) flag tells Apache not to execute any further rules if the current rule applies. There are many more flags to choose from beyond these two examples; we will look at them in more detail later in the article.

Consequently when a user now inputs the URL:


	http://www.example.co.uk/photoshop-tutorials

Apache server will display the following page, without the user knowing any different:


http://www.example.co.uk/1212aJlmo.html

We have now grasped the basic technique of rewriting a single URL to a different URL.

However, our dynamic page has several variables, and using this technique means a lot of work duplicating rewrite rules for every variable.

In the next section we will cover more advanced patterns that can solve this.

DYNAMIC REWRITING USING BACK REFERENCES

If we go back to our original problem:


http://www.exampleshop.co.uk/products.php?id=7

We have the variable ‘id=7’ in this URL; but overall we have 150 products, each with a different ID.

We want the URLs to look like the following example:


http://www.exampleshop.co.uk/product/1/
http://www.exampleshop.co.uk/product/2/
http://www.exampleshop.co.uk/product/3/
http://www.exampleshop.co.uk/product/4/
http://www.exampleshop.co.uk/product/5/
http://www.exampleshop.co.uk/product/6/
http://www.exampleshop.co.uk/product/7/
# etc..

It would take a long time to write individual rewrite rules for all of the possible URLs.

By using the following ‘pattern’ and ‘substitution’ we can save time and also avoid pages of duplicate code:


RewriteRule ^product/([0-9]+)/?$ products.php?id=$1 [NC, L]

Now let’s break this line of code down and take a look at how it works:

  • ([0-9+]) – There are two key points to note on this part of the pattern.
    • Take a look at the contents of our brackets [0-9]+. This is a regular expression, in a regular expression the square brackets [] mean match any of the contents. For example: if we were to use [1A] the regular expression would match for the characters 1 and A.
      Here the square brackets [] contain a range of characters: 0-9 which indicates all digits between and including 0 and 9. The + symbol is a regular expression special character that has the special meaning of “match one or more of the preceding”. In the example pattern we have placed the + after our range [0-9] to detect one or more characters within our range; without the + the pattern will only match with one digit in our range, for example: 1 or 5 but not 11 or 15.
    • The parentheses () in a regular expression refer to a backreference. The $1 in our substitution links to this backreference. For example, if the following URL was input:
      
      http://www.exampleshop.co.uk/product/127
      
      
      127 would be matched to our range in the backreference, resulting in the following substitution:
      
      http://www.exampleshop.co.uk/products.php?id=127
      
      
      There can be multiple backreferences in a pattern, for example:
      
      RewriteRule ^product/([0-9]+)/([0-9]+)?$ products.php?id=$1&cost=$2 [NC, L]
      
      
      Backreferences are numbered in the order they appear. In the example above there are two backreference groups in the pattern, the first group linking to $1 in the substitution and the second group linking to $2. If we were to add a third group it would be numbered $3 and so on.
  • $1 – Is located in our substitution and links to our first backreference, which located in the parentheses in our pattern: ([0-9]+).

SOLVING THE PROBLEM USING REGULAR EXPRESSIONS

The scope of what you can do with regular expressions is so large that it really deserves its own article, therefore, we will only focus on the regular expression we require for the following problem for now:


http://www.exampleshop.co.uk/products.php?id=7

to


http://www.exampleshop.co.uk/buy/organic-fruit/apples

We could successfully use the basic rewrite technique from the start of the article to create one single rule to rewrite this URL. However, in this situation we will assume, as in the previous section, that there are multiple products. We do not want to create individual rewrite rules for every product as this would be very time consuming.

The problem here, that can’t be easily solved with Apache Rewrite Rules, is finding the name of our product related to ‘id=7’ in our database.

This can be done with a PRG: External Rewriting Program but I highly recommend against doing so, as you may find using this technique will cause you countless problems such as: buffering issues, random results returned and many more undesirables.

The problem will need to be resolved through the database and backend code. My recommended solution would be to add a new column to your database table labelled something similar to ‘product_name’ which will contain the name of the product.

Your URL would then become, for example:


http://www.exampleshop.co.uk/products.php?product_name=apples

The work required to make this change is minimal. Even if your website’s backend code relies heavily on the ‘id’ variable from the URL, you can still obtain this variable easily by querying the database at the start of your code to assign a variable for the ID where ‘product_name’ is equal to ‘apples’.

In the previous section we created a dynamic rewrite rule using back references and a regular expression in our pattern to detect multiple digits. We are now going to create a similar rule using back references with a different regular expression pattern for use with our new ‘product_name’ variable.


RewriteRule ^buy/organic-fruit/([a-z]+)/?$ products.php?product_name=$1 [NC, L]

Previously our range was [0-9]. Because our new variable product_name contains letters instead of numerals we have changed the range to [a-z]: to detect all lower case letters between a and z.

What if we want our range to also detect numerals, uppercase letters and even hyphens for product names with a hyphen separator such as ‘apple-juice’?

You can simply add more characters to the range like so: [a-zA-Z0-9-]. Notice how the hyphen has been added to the end of the range, it is added to the end so that it is treated literally rather than as a range separator.

SPECIAL CHARACTERS TO AWARE OF WHEN USING REGULAR EXPRESSIONS

There are certain characters known as Special Characters that we need to be aware of when using regular expressions as they have special meanings.

A good example of a regular expression special character is the period ‘.’ character.

It is quite common for a period to be used in a pattern that includes a file, for example:


RewriteRule ^index.html/?$ index.php [NC, L]

This rewrite rule will work in substituting index.html for index.php.
The problem is that it will also work for substituting index^html, indexshtml, index1html for index.php.

This occurs because the period is a special character in a regular expression with the special meaning: “Any character”. Therefore, the rewrite rule will work with any character that substitutes the position of the period.

To use a period as a Literal Character (i.e. without it’s special meaning) in a regular expression you need to ‘escape’ the period with a preceding back slash: ‘\.’.
For example:


RewriteRule ^index\.html/?$ index.php [NC, L]

Note that we do not need to escape the period used in the substitution, as it is only the pattern that is treated as a regular expression in the rewrite rule.

Regular Expression Special Characters include:

  • * (zero of more of the preceding)
  • . (any character)
  • + (one or more of the preceding)
  • {} (minimum to maximum quantifier)
  • ? (non-greedy modifier)
  • ! (negative modifier)
  • ^ (start of a string, or ‘negative’ if used at the start of a range)
  • $ (end of a string)
  • [] (match any of contents)
  • - (range when used between square brackets)
  • () (backreference group)
  • | (or)
  • \ (the escape character)

The escape character (backslash) itself also needs to be escaped when used as a literal character. Depending on the programming language and parser you may need to use four backslashes instead of two ‘\\\\’, this is because the programming language may also be using a backslash as an escape character.

FLAGS

Throughout this article we have used the flags [NC, L] in our examples: to notify Apache that our rewrite rules are case insensitive, and that no following rules should be applied when the current rewrite rule is in use.

However, these are not the only flags at our disposable when creating rewrite rules. The following flags can also be used in your rules to notify Apache of other information:

  • C (chained with next rule)
  • CO=cookie (set specified cookie, replace cookie with value)
  • E=var:variable (set environment variable var, replace variable with value)
  • F (forbidden, sends a 403 code in header)
  • G (gone, no longer exists)
  • H=handler (set handler, replace handler with value)
  • L (last)
  • N (next)
  • NC (case insensitive)
  • NE (do not escape special URL characters in output)
  • NS (ignore this rule if this is a subrequest)
  • P (proxy)
  • PT (pass through)
  • R (redirect is temporary, sends 302 code in header)
  • R=301 (redirect is permanent, sends 301 code in header)
  • QSA (append query strings to substituted URL)
  • S=x (skip next x rules, replace x with value)
  • T=mime-type (force a specified mime type, replace mime-type with value)

CONDITIONS / EXCEPTION CASES

Rewrite Rules can be also be preceded with a Rewrite Condition. A Rewrite Condition is a conditional query that must be matched for the subsequent Rewrite Rule to pass.

Many websites do not want other websites to be able to hot link to their hosted images. This is a prime example of a scenario where a Rewrite Condition could be used.


RewriteEngine On
RewriteCond %{HTTP_REFERER} !^(.*)?codesmite\.com/ [NC]
RewriteRule .*\.(jpe?g|gif|bmp|png)$ http://www.codesmite.com/no-hot-linking.gif [L]

The example above uses the server variable HTTP_REFERER, this is a server set variable that contains the URL of the reffering page.
If the HTTP_REFERER variable does not contain the string codesmite.com the condition will match.
The subsequent Rewrite Rule will then be passed.
If the URL contains any of the following filetypes: jpg, jpeg, gif, bmp or png the Rewrite Rule pattern will match, and the Substitution URL will redirect to a picture named no-hot-linking.gif containing the message "Please do not hot link to images on our website".
Due to the Rewrite Conditional, the Rewrite Rule will only be applied if the HTTP_REFERER server variable does not contain codesmite.com, this means images viewed on codesmite.com will not be affected.

Rewrite Conditions can also be used to test a string without the use of a Regular Expression or pattern. The following list contains Exceptions or Special Cases that can be used to test a Rewrite Condition string:

  • <pattern (is test string less than pattern)
  • >pattern (is test string greater than pattern)
  • =pattern (is test string equal to pattern)
  • -d (is test string a valid directory)
  • -f (is test string a valid file)
  • -s (is test string a valid file with a size greater than zero)
  • -l (is test string a symbolic link)
  • -F (is test string a valid file, and accessible via a sub-request)
  • -U (is test string a valid URL, and accessible via a sub-request)

The example below would test if the string is a valid file:


RewriteCond %{REQUEST_FILENAME} -f

SUMMARY

If you also use NGINX Web Server, make sure to check out my article on how to create Clean URL Rewrites Using NGINX.

If you require any help creating a specific rewrite rule or require any further information, please feel free to post your question in the comments.

Author Avatar

BY CODIN

Owner of Codesmite.com

Codin is a self taught web developer based in London, UK.
Over the years he has dedicated a lot of time to helping new developers, becoming a well known moderator at Team Treehouse