Who Moved My Data Cheese?
15 May 2013
Recently, during what should probably be a more broadly-practiced routine of wearing other people's hats for a week, I delved into some pretty spotty code (not the person whose hat I was wearing, just some legacy code). Unfortunately, this code also depends on the kindness of strangers (well, vendors) to not change the format of the data it's parsing (some of which may be very unclean). Upon figuring out what it was doing, and were it was going wrong, I made mention on an email thread that the retry logic is somewhat deficient. A bit later, this was met with something along the lines of:
Well, yes, that's a good start--obviously, there are better generalized patterns for doing this, but that's out of scope for now. The broader question is: what kind of failure modes are there, and how am I handling them? These can roughly be summed up into four modes, only one of which is actually handled correctly in the above scenario:
1. Ephemeral processing issues
2. Persistent processing issues
3. Ephemeral data issues
4. Persistent data issues
Let's break those down.
Ephemeral processing issues
Example: transaction becomes a database deadlock victim. Perhaps this should be spoiler tagged, but basically: this is the only failure mode actually corrected by the above changeset. What the changeset itself is doing is trying something, failing, and immediately [key] trying again. There are definitely subclasses of problems that can be addressed this way, but there are many others not covered by this, which I'll get to.
Alright, what can I do? Well, test, for one. You can unit test, but depending on the subclass, these sorts of things can be hard to mock, and fairly low-level. Some other subclasses you can fake with known-bad data, but there are unknown-unknowns, and it's hard to fake deadlocks. You can use stubs, but this seems really, really contrived, and of dubious value. Your best bets are being very pessimistic, building a great method for logging/analyzing errors, and knowing your stack.
Persistent processing issues
Example: your code is terrible. This is the simplest class of problems to understand, but it's not fixed with the changeset, and wouldn't be fixed with an infinite loop around a try/catch block. I'm not going to beat this dead horse, because:
Alright, what can I do? This is also the simplest class of problems to solve. Persistently-failing processes on a known set of data should be cataloged, understood, corrected, and met with regression tests. Persistently-failing processes on an unknown set of data should be met with exasperation and scotch.
Ephemeral data issues
Example: sorry that your data is bad, but it'll be good, we promise! This is actually the motivating example behind this post, and the fact that it isn't covered by the changeset is the motivation for the post itself. What was at issue is that some of the data was hyperlinked to other documents, which were not available at the time of processing, yielding broken processing. I'm not familiar with the underlying data set or its [lack of] atomicity, but this can be generalized to any data set that is eventually consistent. In this case, there's nothing intrinsically wrong with the concept of simply retrying, but the chronology is almost certainly wrong: you have little control over when the retry happens (and are likely going to be wrong if you try, depending on the semantics of the api/data source)
Alright, what can I do? Again, known unkowns and estimated unknowns can be tested, regressed against, etc. What's really at issue here is your retry framework, not your retry logic. If you can't process it now, reprocessing it immediately may not help, and you're not going to be able to synchronously solve the problem. Maybe. Discerning this class from the first requires familiarity with the data set. A good solution to this problem will involve persisting the job, and inserting it in a system of work retry queues. Depending on the scenario, this may be done at regular intervals, with some sort of exponential backoff, etc.
Pesistent data issues
Example: the data is, and will likely remain, in a format that's either currently not useful, or permanently not useful. This, too, should be met with scotch. This is arguably the worst case, because like #2, this will never work through retries alone, but unlike #2, may be out of your control entirely. Sometimes, this involves better parsing or processing, which may be a significant engineering investment. Sometimes, it just can't.
Alright, what can I do? Find a nice Islay. In most cases, the best scenario is to be able to discern this case from the others, and simply log and accept that this won't be able to be parsed. Reattempting in some fashion is just a waste of resources.