Scatter/Gather thoughts

by Johan Petersson

Blocking bad bots without mod_rewrite

Jeremy Zawodny is trying to cope with his comment spammer problems by blocking empty referrers from POSTing comments. In one of the comments Kasia Trapszo takes it a step further by requiring a good referrer (one matching the site). Both are using the Apache extension mod_rewrite.

The Apache URL Rewriting Engine seems to grow more popular for each year. It certainly is a powerful, handy, and cool tool. In some circumstances I even think it's the best one for the task at hand. But using mod_rewrite is difficult for non-programmers and the syntax is rather cryptic even for programmers. That means it's easy to make mistakes and if used extensively the resulting mess of rewrite recipes can be hard to maintain.

Many shared hosting companies don't offer mod_rewrite functionality to their customers. Often this is simply because the mod_rewrite module is not compiled by default; you need to explicitly enable it when building Apache. In other cases, mod_rewrite support are omitted despite customer requests. One reason for doing so is because rewriting mistakes can affect the whole web server, e.g. by introducing infinite loops.

There are also concerns about performance. A mod_rewrite recipe will require matching strings against several regular expressions. Considering the alternatives, this per-request regex matching is fairly efficient and performance will typically not suffer. However, rewriting isn't free and you shouldn't do it unnecessarily.

I frequently see people using mod_rewrite where simpler and easier alternatives exist, so I'll show some examples of what's possible with just common Apache modules. In the Apache 1.3 URL rewriting guide, there's an example of canonicalizing hostnames (slightly edited for clarity):

# Redirect http://example.com/ to http://www.example.com/
RewriteEngine On
RewriteCond %{HTTP_HOST}   !^www\.example\.com [NC]
RewriteCond %{HTTP_HOST}   !^$
RewriteRule ^/(.*)         http://www.example.com/$1 [L,R]

If you're stuck on a shared hosting account with a virtual host configured for both example.com and www.example.com, this might be a reasonable way to make all requests use the same host name. The preferred way (the configuration a competent hosting company will use) is to configure a separate virtual host for non-canonical host names:

<VirtualHost 192.0.34.166:80>
ServerName        example.com
RedirectPermanent / http://www.example.com/
</VirtualHost>

This needs to be done in the server configuration file rather than in a .htaccess file, but it's much easier to understand as well as more efficient: no per-request regex matching needed and zero overhead for the common case. The rewrite recipe also works subtly different; it needs to be changed if the site uses a non-standard port, for example.

For the simplest redirects, there's no reason to use anything other than the Redirect directives. When you just need some regular expression matching and replacement in the URL path, prefer RedirectMatch over rewriting tricks:

RedirectMatch (.*)\.(gif|jpe?g|png)$ http://images.example.com$1.$2

Perhaps the most gratuitous use of mod_rewrite I have ever seen was to block requests based on the client IP address:

RewriteCond %{REMOTE_ADDR} ^205\.209\.177\.
RewriteRule .* - [F]

...which is an incredibly complex and inefficient way to say:

Deny from 205.209.177

The Allow and Deny directives also take subnet restrictions using CIDR notation and are quite capable of dealing with most access restrictions, provided you set environment variables based on request characteristics:

# Known mail harvesters
SetEnvIf User-Agent EmailCollector BAD_BOT
SetEnvIf User-Agent CherryPicker   BAD_BOT

# Code Red and Nimda
SetEnvIf Request_URI ^/default\.ida BAD_BOT=worm
SetEnvIf Request_URI root\.exe      BAD_BOT=worm

# Referrer spam
SetEnvIfNoCase Referer ^http://(www\.)?xopy\.com  BAD_BOT=spammer
SetEnvIfNoCase Referer ^http://(www\.)?aizzo\.com BAD_BOT=spammer

# Bad bot, no cookie!
Order Allow,Deny
Allow from all
Deny from env=BAD_BOT

This functionality can be combined with <Limit> blocks to restrict only certain methods (like POST). You can get even more fancy and match against previously set variables, for basic conditional processing:

SetEnvIf BAD_BOT spammer NOLOG

Having such environment variables also allow you to ignore the corresponding log entries, or put them in a separate log file:

CustomLog /var/httpd/logs/access.log combined env=!NOLOG

Setting the environment variables still incur per-request regex matching and there's always the risk of matching too much with the regular expressions. However, I find the environment variables much less error-prone and easier to maintain than mod_rewrite recipes. By all means use mod_rewrite when it's needed (and available), but it's not always the easiest way and you can get pretty far without it.

9 January, 2005