When caches go bad

In general, caching is a good thing. Creating a temporary local copy of some data can dramatically speed up repeated access to that data, especially if it’s hard to recreate or slow to access. However, if the local copy falls out of sync with the real data it can sometimes be disastrous. Most of the hard work in caching is finding ways to ensure that the cached copy is fresh. This is called ensuring “cache coherency”.

For example, web browsers ask web servers if the remote content is newer than some time. If yes, then the cache is deleted and a new copy is fetched. If not, then the cache is used. This saves time because the simple “Is it fresh?” question is only a few bytes instead of the full size of the content. IE 5 had a painful misfeature that it didn’t send the “Is it fresh?” question for subsidiary files, like .css, .js, .jpg, etc files. So, even if the main .html page was up to date, sometimes IE used the wrong version of the supporting data.

Flash caches incorrectly

Last week I discovered a collosal caching problem with Flash MX 2004 (the authoring software, not the Flash Player that is in plugged into your browser). When switching to MX 2004, most users (me included) notice that compile time is dramatically improved. One of the many reasons is that Flash caches object files behind the scenes. That is, if you have a class defined in a .as file, Flash compiles it to a .aso file and stores that object file deep in a prefs directory. On subsequent compilations, if the .as file is older than the .aso file, the latter is reused, possibly saving several seconds.

Well, Macromedia screwed it up. By making the test be “Is it older?” they opened themselves up to a cache coherency problem. Consider this scenario:

Say, Alice and Bob are working on the same CVS project. Alice checks out some code one morning and starts programming. She makes some changes to Foo.as. Meanwhile, Bob speedily makes a critical bug fix to Bar.as and commits his change. An hour later, Alice finishes her Foo.as work and compiles her .fla. It fails because of the bug in Bar.as. Bob tells her, “I fixed that bug. Just do a cvs update and recompile.” She does, but the bug persists.

Why? It’s all about timestamps. Here’s what happened:

  • Alice checked out Bar.as with timestamp of 10:00 AM
  • Bob edited and checked in Bar.as with timestamp of 10:30 AM
  • Alice compiled at 11:00 AM, creating a Bar.aso file with a 11:00 AM timestamp
  • Alice cvs updated at 11:30 AM, getting Bar.as with a 10:30 AM timestamp
  • Alice recompiled her .fla, but Bar.aso is “newer” than Bar.as, so Flash ignored Bar.as and used the buggy Bar.aso instead.

So, Alice is stuck with an out-of-date .aso file that Flash won’t fix.


There are two easy workarounds to this problem, but both of them require manual intervention.

  1. Touch the .as file

    By editing the .as files, you make it newer again and Flash rebuilds the .aso. Instead of manually editing, the Unix “touch” command comes in handy. Here’s a one-line command to touch all of your project .as files to ensure they are newest:

    find . -name '*.as' | xargs touch

    If you do a lot of Actionscript work, though, this is a weak workaround becuase you may have updated .as files somewhere else in your classpath.

  2. Blow away the .aso files

    Like in the browser world, the answer here is “Clear your cache!” By deleting the .aso files you can force Flash to rebuild them all, at the expense of compile time. The trick is that you need to know where they are. Here’s a command that works on Mac OS X only (all one line):

    find ~/Library/Application*Support/Macromedia/Flash*MX*2004/en/Conf*/Classes \
        -name '*.aso' | perl -lpe 'unlink $_'

You have to remember to run one of these every time you have a case where you might introduce an older-but-changed .as file (often from a source repository like CVS or from a network drive).

What Macromedia should do

How can Macromedia fix this problem? Very simply! This problem was solved by caching C compilers. There are two popular ways to ensure synchronicity between source and object:

  1. Make the object have the EXACT same timestamp as the source
  2. Cache some representation of the source with the object

In the former case, the algorithm for cache testing is this:

if  (timestamp(object file) == timestamp(source file))
    reuse object file
    recompile source file
    set timestamp(object file) = timestamp(source file)

In any POSIX environment, this timestamp setting can be accomplished with the stat() and utime() functions.

This exactitude of timestamps can be problematic when the source and object files are on different drives. For example, if one is on a network drive and the other on a local drive, and the two computers disagree on the current time, the timestamps may not match1.

A better solution is to keep a representation of the source along with the object. Copying the whole source is possible, but wasteful. Instead, a hash of the source can be computed and stored. Then, the next time you want to compile, recompute the hash (much faster than recompiling) and compare. If it matches, use the object; otherwise recompile. The fantastic ccache project uses MD4 hashes of C source code (after macro interpolation) to detect cache hits and misses.


This is an easy mistake to make. For the vast majority of users, Macromedia’s caching solution is a pure win. It’s only when you run into cases where timestamps do not increase monotonically that one encounters problems.

The best advice for programmers trying to implement a caching solution is be paranoid and consider your failure cases first. If you cannot judge your failure cases, or cannot afford any failure, you shouldn’t cache at all. In cases like compilation where a correct solution is paramount, correct-but-slow data is always better than fast-but-crappy data. This is even more true in financial settings. Imagine a scenario where ATMs cached data about account balances. All a gang of crooks would need to do is hit two ATMs at the same time. The same can go for e-voting machines without paper trails. In some cases, it’s better to just not cache because you can spend so much developer effort ensuring that the cache is good that you lose time making the process good in general.

1 This is a problem I’ve experienced with SMB. Generally, I try to allow a two-second timestamp window. If the machines are both NTP synched, then they should always agree to that level of precision, one hopes.

Working with UTF-8 on OSX

One of my recent projects has been to port a Clotho application to Japanese. We chose this opportunity to convert the code base fully to Unicode. The other obvious choice would have been to port to Shift-JIS, a popular encoding for Japanese characters. However, we decided that this latter choice was not forward-looking enough. Choosing Unicode, specifically the UTF-8 encoding, will allow us to port to other languages more easily.

This article is an accumulation of my learnings configuring my Mac to default to UTF-8 wherever possible. Most Mac OS X apps default to ISO-8859-1 (aka Latin-1) in the English locale that I use. Fortunately, most of the common utilities have preferences to change that default, although the prefs are different for each App.

My Unix command line setup is still based on the default “C” locale (aka ASCII). In a future post I hope to explore switching that to UTF-8 as well.


Under OSX, Safari controls the general encoding prefs for the whole system.

  • open Safari Preferences
  • select Appearance tab
  • change Default Encoding to “Unicode (UTF-8)”


  • open Firefox Preferences
  • select General tab
  • click Languages button
  • change Default Character Encoding to “Unicode (UTF-8)”


  • open TextEdit Preferences
  • change the Open pulldown from “Automatic” to “Unicode (UTF-8)”
  • change the Save pulldown from “Automatic” to “Unicode (UTF-8)”

When in doubt, open files with Cmd-O instead of double-clicking them


Warning! BBEdit 6.5 has a bug that files opened via drag or double-click ignoring the Open settings. When working with non-ASCII files, use Cmd-O. I have not tested any BBEdit versions newer than v6.5

  • open BBEdit Preferences
  • select Text Files: Opening
  • check “Translate Line Breaks”
  • set “Interpret File Contents as:” to “UTF-8”
  • check “Warn of Malformed UTF-8 Files”
  • select Text Files: Saving
  • set “Default Line Breaks:” to “Unix”


  • open Terminal Window Settings
  • select the Display options
  • check “Wide glyphs for Japanese/Chinese/etc.”
  • check “Wide glyphs count as 2 columns”
  • change “Character Set Encoding:” to “Unicode (UTF-8)”
  • click Use Settings as Defaults


(this is for Flash MX 2004)

  • open Flash Preferences
  • select the ActionScript tab
  • change Open/Import to “UTF-8”
  • change Save/Export to “UTF-8”

Warning! Flash does NOT obey this preference for files created outside of Flash! Flash only interprets files as Unicode if they have a Byte-Order Mark (BOM). However, BOMs are a pain in plain text files since they break some editors.

As a workaround we recommend putting all non-ASCII dynamic text in external XML files instead of directly in the ActionScript.


Upgrade to v21.3. This is newer than the version that is supplied by Apple for either 10.3 or 10.4. Fink provides this version (only in unstable as of this writing).

Add the following to your ~/.emacs file:

(setq locale-coding-system 'utf-8)
(set-terminal-coding-system 'utf-8)
(set-keyboard-coding-system 'utf-8)
(set-selection-coding-system 'utf-8)
(prefer-coding-system 'utf-8)

Advanced Installer

One of my work projects is a Windows-only media conversion application. As happens often in the software world, I had to hurry up and learn about installers. So, in late summer 2004 I googled and found Advanced Installer to be highly ranked. I tried the demo, as well as looking into a couple of other installer builders. I was fairly impressed, so we ended up buying the Pro version of Advanced Installer and we’ve been using it since then.

While to company is based in California, the programmers are all in Romania. Nonetheless, I’ve had excellent response to a couple of support requests I’ve submitted. In both cases, I reported bugs and they were fixed within 3 days and 1 day respectively. Sweet.

The project description file is all XML, so it’s possible to hand edit, but it’s kind of crufty, so I don’t recommend that. The most basic activities are supported via the command line (updating the version number and building an installer), and they promise that more command line actions will be available in the future.



About ten years ago, Yahoo appeared on the net. It was so very cool because it was a central place where you could find anything you were looking for. By being popular itself, Yahoo decided what on the web was cool and what was not. Aside from the original NCSA list of all web pages (!), it was the portal site I ever saw. Since then, portals have risen and fallen (e.g. Lycos, Netscape, Altavista). Today, Google released a beta of its new portalized. It’s nice that have plans to let you uber-customize it, but right now it’s quite ho-hum.

The problem with portals is that they decide what’s important and what’s not. If your interests coincide with the portal authors, then you are well served. If you are not a perfect match, you will have to sift through much chaff to reach the kernel of information you seek. Some portals have tried to offer some degree of personalization. For example, Slashdot offers pre-defined portals for broad niches (e.g. apple.slashdot.org) as well as the ability to filter out unwanted topics and authors (e.g. JonKatz). That goes a long way towards satisfying the needs of the tech geeks, but it still relies on the much-maligned Slashdot editors to make the initial content selections from which readers may filter.

To get a full serving of content, the typical avid reader would hit his or her 10+ sites on a regular basis to keep up on the latest.


Having to hit more than 10 sites just to see if they have anything new is tedious. Tabbed browsing and bookmarked tab groups does help cut down the overhead, but its still overhead. Why not put the computer to work to do some of this search work for us?

Enter aggregation.

Aggregation tools build your own personal portal for you by periodically querying the sources that interest you and pulling the prime content into a digestible form. The basic tools to permit this have been around for a while, but only in the last year has it really come together. The important bit was that the content producers out there had to standardize on a computer readable structure for their offerings. Thus, we now see RSS, RDF, Atom, and OPML links all over the place, collectively called feeds. These are XML-based file formats to collect abstracts of web-based content. The small files are fast to download and easy to process, so it’s simple to write software to read them. This simplicity created an explosion of parses, which then snowballed into a further explosion of feed offerings.

The next step is to accumulate all of these feeds (and perhaps sort and filter them along the way) into a body of information that the user can absorb. This aggregation process is the real energy saver. Popular tools include client-side solutions like NetNewsWire, Sage (a Firefox extension) and now Safari RSS. On the server side, aggregators that produce concise HTML include PlanetPlanet (my personal favorite), Bloglines and Radio Userland.

In my personal experience, my aggregation has led to me being able to absorb about four times as many news stories as before in about half the time. Much of this time saving arises from:

  1. not having to visit all of those pages,
  2. skipping the ads (and the associated download time)
  3. some content filtering, and
  4. knowing when to stop.

This last point is critical for me. All I have to remember is what was the youngest story I read last session. When I see that story again, I know that I’m all caught up. Without aggregation, I have to remember what was the last story I read for every site. That’s a lot of mental energy. Aggregation makes it more like an email inbox — one stream of information — and that’s cool.

What are our computers for if not to act as agents for our interests?


Aggregation will only become more significant in the next couple of years. Google is aggregating newspaper sites all over the world. Many open-source developer groups rely on Planet feeds, like PlanetMozilla, to get a quick read on what’s current. As more sources provide feeds, and the feedparsers get their remaining bugs worked out (e.g. encoding differences, xhtml vs. html, etc.), and the authors get better at self-filtering (I do not want to see vacation photos from the Mozilla developers!) then aggregates can come into their prime. By leveraging primary sources, the news can be even fresher than from portals.

The major remaining obstacles are discovery and trust. Discovering new sources is time consuming, because you generally read a lot of material that is of low interest. But relying on others to discover sources for you just leads back to the portal days where you lack control. Like a financial portfolio, I posit that diversity is the key to a good aggregate today. My personal reading list is a mix of primary sources (like Tom Tongue), intermediate aggregates (the Planet sites) and rehashed news moderated by an editor (like MozillaZine). Primary sources are good because you get the news soonest and have the greatest control over what you read. Edited content is good because someone who (presumably) has a brain has read every item before you did and culled the worst of the trolls out. Intermediate aggregates are somewhere in between: they include authors that are usually interesting.

Social networking can help with both the discovery and trust issues. A promising future direction for aggregation tools is the sharing of moderation between friends. For example, I’ve had some success with DaringFireball (recommended to me by Peter Erwin, IIRC), but not enough to add it to my daily rotation. I would be thrilled to have an automated way for friends to send me a “Best of…” for that feed. Future aggregation clients will allow the user to flag the best/worst stories and republish that list for their friends to see. That re-publication would be subject to normal aggregate filtering, of course, so you could just cherry-pick from common interests with friends.


Aggregates are the new portals, offering more control before than ever to users in the way the intake information. I wonder what comes next?

Unicode development under Apache

One of my current projects is to port an application to Japanese. The first port is always the hardest1, so I’ve learned a few things in the process. I’m going to accumulate a few of my successes in this blog category. The first and most significant is that the way encodings work in HTTP/HTML is weird!

Take a peek at this slide from a talk by Sam Ruby, which shows an example HTML page with conflicting metadata. When there are conflicting directives indicating which encoding to use for the document, can you guess which one wins? You may be surprised to learn that the encoding specified in the HTTP Content-Type has precedence over the encoding declared in the HTML file! That is to say, if your HTML document claims

    <meta http-equiv="Content-Type"
              content="text/html; charset=Shift-JIS" />

and Apache says

    Content-Type: text/html; charset=ISO-8859-1

then Apache wins and your Japanese page will be rendered as Latin-1 in the browser and will likely be garbled. Apache’s out of the box configuration often includes a default encoding2 which may or may not be right.

There are two solutions to this problem:

  1. Make Apache ignore encoding
  2. Use exactly one encoding everywhere and always

The latter is good practice, but the former is easier. To make Apache ignore encoding, search your httpd.conf file for any AddDefaultCharset lines and removing them.

In our project, we chose the other route, making the obvious choice to use UTF-8 everywhere. We added this line to Apache:

    AddDefaultCharset UTF-8

and these lines to all HTML and XHTML files, respectively:

    <meta http-equiv="Content-Type"
           content="text/html; charset=utf-8" >
    <meta http-equiv="Content-Type" 
           content="application/xhtml+xml; charset=utf-8" />

Then, the major remaining hurdle was to ensure that all of our development tools actually read and write UTF-8. That will be the subject of a future post.

1 I’ve found this to be universally true for language, hardware, OS, API and other types of porting.

2 Two data points:

  1. Our main webserver runs RedHat, whose Apache had AddDefaultCharset ISO-8859-1
  2. The default Apache configuration under Mac OS X does not include a default character set. Good job Apple!