Archive for the ‘I18N’ Category

Working with UTF-8 on OSX

Wednesday, June 1st, 2005

One of my recent projects has been to port a Clotho application to Japanese. We chose this opportunity to convert the code base fully to Unicode. The other obvious choice would have been to port to Shift-JIS, a popular encoding for Japanese characters. However, we decided that this latter choice was not forward-looking enough. Choosing Unicode, specifically the UTF-8 encoding, will allow us to port to other languages more easily.

This article is an accumulation of my learnings configuring my Mac to default to UTF-8 wherever possible. Most Mac OS X apps default to ISO-8859-1 (aka Latin-1) in the English locale that I use. Fortunately, most of the common utilities have preferences to change that default, although the prefs are different for each App.

My Unix command line setup is still based on the default “C” locale (aka ASCII). In a future post I hope to explore switching that to UTF-8 as well.

General

Under OSX, Safari controls the general encoding prefs for the whole system.

  • open Safari Preferences
  • select Appearance tab
  • change Default Encoding to “Unicode (UTF-8)”

Firefox

  • open Firefox Preferences
  • select General tab
  • click Languages button
  • change Default Character Encoding to “Unicode (UTF-8)”

TextEdit

  • open TextEdit Preferences
  • change the Open pulldown from “Automatic” to “Unicode (UTF-8)”
  • change the Save pulldown from “Automatic” to “Unicode (UTF-8)”

When in doubt, open files with Cmd-O instead of double-clicking them

BBEdit

Warning! BBEdit 6.5 has a bug that files opened via drag or double-click ignoring the Open settings. When working with non-ASCII files, use Cmd-O. I have not tested any BBEdit versions newer than v6.5

  • open BBEdit Preferences
  • select Text Files: Opening
  • check “Translate Line Breaks”
  • set “Interpret File Contents as:” to “UTF-8″
  • check “Warn of Malformed UTF-8 Files”
  • select Text Files: Saving
  • set “Default Line Breaks:” to “Unix”

Terminal

  • open Terminal Window Settings
  • select the Display options
  • check “Wide glyphs for Japanese/Chinese/etc.”
  • check “Wide glyphs count as 2 columns”
  • change “Character Set Encoding:” to “Unicode (UTF-8)”
  • click Use Settings as Defaults

Flash

(this is for Flash MX 2004)

  • open Flash Preferences
  • select the ActionScript tab
  • change Open/Import to “UTF-8″
  • change Save/Export to “UTF-8″

Warning! Flash does NOT obey this preference for files created outside of Flash! Flash only interprets files as Unicode if they have a Byte-Order Mark (BOM). However, BOMs are a pain in plain text files since they break some editors.

As a workaround we recommend putting all non-ASCII dynamic text in external XML files instead of directly in the ActionScript.

Emacs

Upgrade to v21.3. This is newer than the version that is supplied by Apple for either 10.3 or 10.4. Fink provides this version (only in unstable as of this writing).

Add the following to your ~/.emacs file:

(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(prefer-coding-system 'utf-8)