Unicode development under Apache

One of my current projects is to port an application to Japanese. The first port is always the hardest1, so I’ve learned a few things in the process. I’m going to accumulate a few of my successes in this blog category. The first and most significant is that the way encodings work in HTTP/HTML is weird!

Take a peek at this slide from a talk by Sam Ruby, which shows an example HTML page with conflicting metadata. When there are conflicting directives indicating which encoding to use for the document, can you guess which one wins? You may be surprised to learn that the encoding specified in the HTTP Content-Type has precedence over the encoding declared in the HTML file! That is to say, if your HTML document claims

    <meta http-equiv="Content-Type"
              content="text/html; charset=Shift-JIS" />

and Apache says

    Content-Type: text/html; charset=ISO-8859-1

then Apache wins and your Japanese page will be rendered as Latin-1 in the browser and will likely be garbled. Apache’s out of the box configuration often includes a default encoding2 which may or may not be right.

There are two solutions to this problem:

  1. Make Apache ignore encoding
  2. Use exactly one encoding everywhere and always

The latter is good practice, but the former is easier. To make Apache ignore encoding, search your httpd.conf file for any AddDefaultCharset lines and removing them.

In our project, we chose the other route, making the obvious choice to use UTF-8 everywhere. We added this line to Apache:

    AddDefaultCharset UTF-8

and these lines to all HTML and XHTML files, respectively:

    <meta http-equiv="Content-Type"
           content="text/html; charset=utf-8" >
    <meta http-equiv="Content-Type" 
           content="application/xhtml+xml; charset=utf-8" />

Then, the major remaining hurdle was to ensure that all of our development tools actually read and write UTF-8. That will be the subject of a future post.

1 I’ve found this to be universally true for language, hardware, OS, API and other types of porting.

2 Two data points:

  1. Our main webserver runs RedHat, whose Apache had AddDefaultCharset ISO-8859-1
  2. The default Apache configuration under Mac OS X does not include a default character set. Good job Apple!

Ideas and tools to improve programming throughput.