Unicode development under Apache

One of my current projects is to port an application to Japanese. The first port is always the hardest1, so I’ve learned a few things in the process. I’m going to accumulate a few of my successes in this blog category. The first and most significant is that the way encodings work in HTTP/HTML is weird!

Take a peek at this slide from a talk by Sam Ruby, which shows an example HTML page with conflicting metadata. When there are conflicting directives indicating which encoding to use for the document, can you guess which one wins? You may be surprised to learn that the encoding specified in the HTTP Content-Type has precedence over the encoding declared in the HTML file! That is to say, if your HTML document claims

    <meta http-equiv="Content-Type"
              content="text/html; charset=Shift-JIS" />

and Apache says

    Content-Type: text/html; charset=ISO-8859-1

then Apache wins and your Japanese page will be rendered as Latin-1 in the browser and will likely be garbled. Apache’s out of the box configuration often includes a default encoding2 which may or may not be right.

There are two solutions to this problem:

  1. Make Apache ignore encoding
  2. Use exactly one encoding everywhere and always

The latter is good practice, but the former is easier. To make Apache ignore encoding, search your httpd.conf file for any AddDefaultCharset lines and removing them.

In our project, we chose the other route, making the obvious choice to use UTF-8 everywhere. We added this line to Apache:

    AddDefaultCharset UTF-8

and these lines to all HTML and XHTML files, respectively:

    <meta http-equiv="Content-Type"
           content="text/html; charset=utf-8" >
    <meta http-equiv="Content-Type" 
           content="application/xhtml+xml; charset=utf-8" />

Then, the major remaining hurdle was to ensure that all of our development tools actually read and write UTF-8. That will be the subject of a future post.

1 I’ve found this to be universally true for language, hardware, OS, API and other types of porting.

2 Two data points:

  1. Our main webserver runs RedHat, whose Apache had AddDefaultCharset ISO-8859-1
  2. The default Apache configuration under Mac OS X does not include a default character set. Good job Apple!

One thought on “Unicode development under Apache”

  1. Two questions/thoughts:

    1. Can’t you put the content encoding in an .htaccess file? What I mean is: That would appear to be an option between “ignore all coding” and “use one coding everywhere”.

    See http://www.w3.org/International/questions/qa-htaccess-charset

    1. Does normally change the HTTP header info sent by Apache? I’m not as familiar with that as I should be, I guess. But, it seems like this might be a browser content-negotiation issue. (As in, Apache’s HTTP Headers say one thing, the HTML file headers say something else.)

Comments are closed.