Who Peed in the Object Pool?

12 June 2013

In the past few weeks, I'd read via Twitter an awesome rationalization for immutability, which invokes a now-deeply-held bugaboo [Source]:

GOTO was evil because we asked, "how did I get to this point of execution?" Mutability leaves us with, "how did I get to this state?"

I've been [lazily] evangelizing here, to work colleagues, around the block, ad nauseum, about the benefits of modern functional languages, not least of which being a ground-up approach to embracing immutability (by default, if not by decree). This recently bit me in the ass with some code I'd naïvely written a few years ago.

Not uncommonly, we have certain classes of data--metadata, in this case--that are extremely read-heavy, and only updated on occasion; maybe months, potentially years. In this particular case, it's metadata about financial instruments; alternate identifiers, full names, which exchange it's listed on, that sort of thing. As one might expect, this is retrieved by almost every user-facing request in some capacity; if it doesn't scale, nothing scales. The retrieval system itself is something of hybrid Strategy pattern, where each concrete strategy looks roughly like:

string GetCacheKey(Tkey key)
Tval Get(string key)
void SetOther(Tkey key, Tval value)

The basic premise here is that each request walks a list of implementations, ordered by cost-of-access from least to greatest, and that any given strategy/implementation, if successful, has the ability to promote its result to a faster cache layer, typically with a shorter TTL. The first two implementations of this are basically reading directly from memory: the first is in-process, the second is out-of-process (e.g. external cache server[s]). A network hop, especially in EC2, is not free, and again, this data doesn't change much. If we have to go out to the cache server, let's at least promote it to in-process cache, so that we have faster access next time.

Fine, great. Early in the project, I'd added copy-on-set logic to the in-process cache, so that if a request retrieved an object from slow storage and promoted, the caller can freely modify it for the lifetime of the request, while leaving a pristine copy in cache.

lulz

Like "knowing," that was half the battle, and that suited perfectly fine for a long time, because this metadata wasn't mutated by convention. Even the error case that eventually cropped up wasn't caught by any tests. At some point, some additional entitlement restrictions were added, requiring that if a user did not have particular licensing documentation in place, a few fields would need to be scrubbed before being returned.

So who peed in the object pool?

So yeah, that doesn't cut it. The thing is, while we had unit tests with pretty good coverage, no one, not test accounts, not gulp live users, caught it for a long time, because not many people had the proper licenses. When it was--painfully, and after over too long a period--discovered, it became clear that what was going down was that:

At some point, a user retrieves a value warm in cache
The user did not have the proper license, and the fields were scrubbed
All subsequent reads, until the key expired, had those fields scrubbed

So there's your culprit. The pool-pisser. And the immediate solution there was pretty simple: adding copy-on-read semantics to the in-process cache. I guess. Begrudgingly. The way forward? Oh, I don't pretend to know. But making things immmutable, at least by default, seems like a good start. Or copy-on-write, which would be transparent to the user, and carry the safety of immutability, but (on first blush) it seems like adding that kind of complexity to all your precious POJOs and POCOs and POROs would be overkill.

Seems.