Category Archives: General

When caches go bad

In general, caching is a good thing. Creating a temporary local copy of some data can dramatically speed up repeated access to that data, especially if it’s hard to recreate or slow to access. However, if the local copy falls out of sync with the real data it can sometimes be disastrous. Most of the hard work in caching is finding ways to ensure that the cached copy is fresh. This is called ensuring “cache coherency”.

For example, web browsers ask web servers if the remote content is newer than some time. If yes, then the cache is deleted and a new copy is fetched. If not, then the cache is used. This saves time because the simple “Is it fresh?” question is only a few bytes instead of the full size of the content. IE 5 had a painful misfeature that it didn’t send the “Is it fresh?” question for subsidiary files, like .css, .js, .jpg, etc files. So, even if the main .html page was up to date, sometimes IE used the wrong version of the supporting data.

Flash caches incorrectly

Last week I discovered a collosal caching problem with Flash MX 2004 (the authoring software, not the Flash Player that is in plugged into your browser). When switching to MX 2004, most users (me included) notice that compile time is dramatically improved. One of the many reasons is that Flash caches object files behind the scenes. That is, if you have a class defined in a .as file, Flash compiles it to a .aso file and stores that object file deep in a prefs directory. On subsequent compilations, if the .as file is older than the .aso file, the latter is reused, possibly saving several seconds.

Well, Macromedia screwed it up. By making the test be “Is it older?” they opened themselves up to a cache coherency problem. Consider this scenario:

Say, Alice and Bob are working on the same CVS project. Alice checks out some code one morning and starts programming. She makes some changes to Foo.as. Meanwhile, Bob speedily makes a critical bug fix to Bar.as and commits his change. An hour later, Alice finishes her Foo.as work and compiles her .fla. It fails because of the bug in Bar.as. Bob tells her, “I fixed that bug. Just do a cvs update and recompile.” She does, but the bug persists.

Why? It’s all about timestamps. Here’s what happened:

  • Alice checked out Bar.as with timestamp of 10:00 AM
  • Bob edited and checked in Bar.as with timestamp of 10:30 AM
  • Alice compiled at 11:00 AM, creating a Bar.aso file with a 11:00 AM timestamp
  • Alice cvs updated at 11:30 AM, getting Bar.as with a 10:30 AM timestamp
  • Alice recompiled her .fla, but Bar.aso is “newer” than Bar.as, so Flash ignored Bar.as and used the buggy Bar.aso instead.

So, Alice is stuck with an out-of-date .aso file that Flash won’t fix.

Workarounds

There are two easy workarounds to this problem, but both of them require manual intervention.

  1. Touch the .as file

    By editing the .as files, you make it newer again and Flash rebuilds the .aso. Instead of manually editing, the Unix “touch” command comes in handy. Here’s a one-line command to touch all of your project .as files to ensure they are newest:

    find . -name '*.as' | xargs touch
    

    If you do a lot of Actionscript work, though, this is a weak workaround becuase you may have updated .as files somewhere else in your classpath.

  2. Blow away the .aso files

    Like in the browser world, the answer here is “Clear your cache!” By deleting the .aso files you can force Flash to rebuild them all, at the expense of compile time. The trick is that you need to know where they are. Here’s a command that works on Mac OS X only (all one line):

    find ~/Library/Application*Support/Macromedia/Flash*MX*2004/en/Conf*/Classes \
        -name '*.aso' | perl -lpe 'unlink $_'
    

You have to remember to run one of these every time you have a case where you might introduce an older-but-changed .as file (often from a source repository like CVS or from a network drive).

What Macromedia should do

How can Macromedia fix this problem? Very simply! This problem was solved by caching C compilers. There are two popular ways to ensure synchronicity between source and object:

  1. Make the object have the EXACT same timestamp as the source
  2. Cache some representation of the source with the object

In the former case, the algorithm for cache testing is this:

if  (timestamp(object file) == timestamp(source file))
    reuse object file
else
    recompile source file
    set timestamp(object file) = timestamp(source file)

In any POSIX environment, this timestamp setting can be accomplished with the stat() and utime() functions.

This exactitude of timestamps can be problematic when the source and object files are on different drives. For example, if one is on a network drive and the other on a local drive, and the two computers disagree on the current time, the timestamps may not match1.

A better solution is to keep a representation of the source along with the object. Copying the whole source is possible, but wasteful. Instead, a hash of the source can be computed and stored. Then, the next time you want to compile, recompute the hash (much faster than recompiling) and compare. If it matches, use the object; otherwise recompile. The fantastic ccache project uses MD4 hashes of C source code (after macro interpolation) to detect cache hits and misses.

Conclusions

This is an easy mistake to make. For the vast majority of users, Macromedia’s caching solution is a pure win. It’s only when you run into cases where timestamps do not increase monotonically that one encounters problems.

The best advice for programmers trying to implement a caching solution is be paranoid and consider your failure cases first. If you cannot judge your failure cases, or cannot afford any failure, you shouldn’t cache at all. In cases like compilation where a correct solution is paramount, correct-but-slow data is always better than fast-but-crappy data. This is even more true in financial settings. Imagine a scenario where ATMs cached data about account balances. All a gang of crooks would need to do is hit two ATMs at the same time. The same can go for e-voting machines without paper trails. In some cases, it’s better to just not cache because you can spend so much developer effort ensuring that the cache is good that you lose time making the process good in general.


1 This is a problem I’ve experienced with SMB. Generally, I try to allow a two-second timestamp window. If the machines are both NTP synched, then they should always agree to that level of precision, one hopes.

Aggregation

Then

About ten years ago, Yahoo appeared on the net. It was so very cool because it was a central place where you could find anything you were looking for. By being popular itself, Yahoo decided what on the web was cool and what was not. Aside from the original NCSA list of all web pages (!), it was the portal site I ever saw. Since then, portals have risen and fallen (e.g. Lycos, Netscape, Altavista). Today, Google released a beta of its new portalized. It’s nice that have plans to let you uber-customize it, but right now it’s quite ho-hum.

The problem with portals is that they decide what’s important and what’s not. If your interests coincide with the portal authors, then you are well served. If you are not a perfect match, you will have to sift through much chaff to reach the kernel of information you seek. Some portals have tried to offer some degree of personalization. For example, Slashdot offers pre-defined portals for broad niches (e.g. apple.slashdot.org) as well as the ability to filter out unwanted topics and authors (e.g. JonKatz). That goes a long way towards satisfying the needs of the tech geeks, but it still relies on the much-maligned Slashdot editors to make the initial content selections from which readers may filter.

To get a full serving of content, the typical avid reader would hit his or her 10+ sites on a regular basis to keep up on the latest.

Now

Having to hit more than 10 sites just to see if they have anything new is tedious. Tabbed browsing and bookmarked tab groups does help cut down the overhead, but its still overhead. Why not put the computer to work to do some of this search work for us?

Enter aggregation.

Aggregation tools build your own personal portal for you by periodically querying the sources that interest you and pulling the prime content into a digestible form. The basic tools to permit this have been around for a while, but only in the last year has it really come together. The important bit was that the content producers out there had to standardize on a computer readable structure for their offerings. Thus, we now see RSS, RDF, Atom, and OPML links all over the place, collectively called feeds. These are XML-based file formats to collect abstracts of web-based content. The small files are fast to download and easy to process, so it’s simple to write software to read them. This simplicity created an explosion of parses, which then snowballed into a further explosion of feed offerings.

The next step is to accumulate all of these feeds (and perhaps sort and filter them along the way) into a body of information that the user can absorb. This aggregation process is the real energy saver. Popular tools include client-side solutions like NetNewsWire, Sage (a Firefox extension) and now Safari RSS. On the server side, aggregators that produce concise HTML include PlanetPlanet (my personal favorite), Bloglines and Radio Userland.

In my personal experience, my aggregation has led to me being able to absorb about four times as many news stories as before in about half the time. Much of this time saving arises from:

  1. not having to visit all of those pages,
  2. skipping the ads (and the associated download time)
  3. some content filtering, and
  4. knowing when to stop.

This last point is critical for me. All I have to remember is what was the youngest story I read last session. When I see that story again, I know that I’m all caught up. Without aggregation, I have to remember what was the last story I read for every site. That’s a lot of mental energy. Aggregation makes it more like an email inbox — one stream of information — and that’s cool.

What are our computers for if not to act as agents for our interests?

Future

Aggregation will only become more significant in the next couple of years. Google is aggregating newspaper sites all over the world. Many open-source developer groups rely on Planet feeds, like PlanetMozilla, to get a quick read on what’s current. As more sources provide feeds, and the feedparsers get their remaining bugs worked out (e.g. encoding differences, xhtml vs. html, etc.), and the authors get better at self-filtering (I do not want to see vacation photos from the Mozilla developers!) then aggregates can come into their prime. By leveraging primary sources, the news can be even fresher than from portals.

The major remaining obstacles are discovery and trust. Discovering new sources is time consuming, because you generally read a lot of material that is of low interest. But relying on others to discover sources for you just leads back to the portal days where you lack control. Like a financial portfolio, I posit that diversity is the key to a good aggregate today. My personal reading list is a mix of primary sources (like Tom Tongue), intermediate aggregates (the Planet sites) and rehashed news moderated by an editor (like MozillaZine). Primary sources are good because you get the news soonest and have the greatest control over what you read. Edited content is good because someone who (presumably) has a brain has read every item before you did and culled the worst of the trolls out. Intermediate aggregates are somewhere in between: they include authors that are usually interesting.

Social networking can help with both the discovery and trust issues. A promising future direction for aggregation tools is the sharing of moderation between friends. For example, I’ve had some success with DaringFireball (recommended to me by Peter Erwin, IIRC), but not enough to add it to my daily rotation. I would be thrilled to have an automated way for friends to send me a “Best of…” for that feed. Future aggregation clients will allow the user to flag the best/worst stories and republish that list for their friends to see. That re-publication would be subject to normal aggregate filtering, of course, so you could just cherry-pick from common interests with friends.

Conclusion

Aggregates are the new portals, offering more control before than ever to users in the way the intake information. I wonder what comes next?