Category Archives: I18N

Explorations into Internationalization (I18N), Localization (L10N) and Unicode.

Working with UTF-8 on OSX

One of my recent projects has been to port a Clotho application to Japanese. We chose this opportunity to convert the code base fully to Unicode. The other obvious choice would have been to port to Shift-JIS, a popular encoding for Japanese characters. However, we decided that this latter choice was not forward-looking enough. Choosing Unicode, specifically the UTF-8 encoding, will allow us to port to other languages more easily.

This article is an accumulation of my learnings configuring my Mac to default to UTF-8 wherever possible. Most Mac OS X apps default to ISO-8859-1 (aka Latin-1) in the English locale that I use. Fortunately, most of the common utilities have preferences to change that default, although the prefs are different for each App.

My Unix command line setup is still based on the default “C” locale (aka ASCII). In a future post I hope to explore switching that to UTF-8 as well.

General

Under OSX, Safari controls the general encoding prefs for the whole system.

  • open Safari Preferences
  • select Appearance tab
  • change Default Encoding to “Unicode (UTF-8)”

Firefox

  • open Firefox Preferences
  • select General tab
  • click Languages button
  • change Default Character Encoding to “Unicode (UTF-8)”

TextEdit

  • open TextEdit Preferences
  • change the Open pulldown from “Automatic” to “Unicode (UTF-8)”
  • change the Save pulldown from “Automatic” to “Unicode (UTF-8)”

When in doubt, open files with Cmd-O instead of double-clicking them

BBEdit

Warning! BBEdit 6.5 has a bug that files opened via drag or double-click ignoring the Open settings. When working with non-ASCII files, use Cmd-O. I have not tested any BBEdit versions newer than v6.5

  • open BBEdit Preferences
  • select Text Files: Opening
  • check “Translate Line Breaks”
  • set “Interpret File Contents as:” to “UTF-8”
  • check “Warn of Malformed UTF-8 Files”
  • select Text Files: Saving
  • set “Default Line Breaks:” to “Unix”

Terminal

  • open Terminal Window Settings
  • select the Display options
  • check “Wide glyphs for Japanese/Chinese/etc.”
  • check “Wide glyphs count as 2 columns”
  • change “Character Set Encoding:” to “Unicode (UTF-8)”
  • click Use Settings as Defaults

Flash

(this is for Flash MX 2004)

  • open Flash Preferences
  • select the ActionScript tab
  • change Open/Import to “UTF-8”
  • change Save/Export to “UTF-8”

Warning! Flash does NOT obey this preference for files created outside of Flash! Flash only interprets files as Unicode if they have a Byte-Order Mark (BOM). However, BOMs are a pain in plain text files since they break some editors.

As a workaround we recommend putting all non-ASCII dynamic text in external XML files instead of directly in the ActionScript.

Emacs

Upgrade to v21.3. This is newer than the version that is supplied by Apple for either 10.3 or 10.4. Fink provides this version (only in unstable as of this writing).

Add the following to your ~/.emacs file:

(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(prefer-coding-system 'utf-8)

Unicode development under Apache

One of my current projects is to port an application to Japanese. The first port is always the hardest1, so I’ve learned a few things in the process. I’m going to accumulate a few of my successes in this blog category. The first and most significant is that the way encodings work in HTTP/HTML is weird!

Take a peek at this slide from a talk by Sam Ruby, which shows an example HTML page with conflicting metadata. When there are conflicting directives indicating which encoding to use for the document, can you guess which one wins? You may be surprised to learn that the encoding specified in the HTTP Content-Type has precedence over the encoding declared in the HTML file! That is to say, if your HTML document claims

    <meta http-equiv="Content-Type"
              content="text/html; charset=Shift-JIS" />

and Apache says

    Content-Type: text/html; charset=ISO-8859-1

then Apache wins and your Japanese page will be rendered as Latin-1 in the browser and will likely be garbled. Apache’s out of the box configuration often includes a default encoding2 which may or may not be right.

There are two solutions to this problem:

  1. Make Apache ignore encoding
  2. Use exactly one encoding everywhere and always

The latter is good practice, but the former is easier. To make Apache ignore encoding, search your httpd.conf file for any AddDefaultCharset lines and removing them.

In our project, we chose the other route, making the obvious choice to use UTF-8 everywhere. We added this line to Apache:

    AddDefaultCharset UTF-8

and these lines to all HTML and XHTML files, respectively:

    <meta http-equiv="Content-Type"
           content="text/html; charset=utf-8" >
    <meta http-equiv="Content-Type" 
           content="application/xhtml+xml; charset=utf-8" />

Then, the major remaining hurdle was to ensure that all of our development tools actually read and write UTF-8. That will be the subject of a future post.


1 I’ve found this to be universally true for language, hardware, OS, API and other types of porting.

2 Two data points:

  1. Our main webserver runs RedHat, whose Apache had AddDefaultCharset ISO-8859-1
  2. The default Apache configuration under Mac OS X does not include a default character set. Good job Apple!