Definitely Maybe

08 March 2014

Over the past few months, I've spent spent most of my time settling into my new (well, now not-so-new) role from working in backend Web architecture in .NET to building out data warehousing and analysis systems. It's quite a difference, not least of which going from the B2B world to the massive online B2-Web-at-large world--though most "full-stack engineer" work can be somewhat agnostic to the differences, if you're careful about your architecture.

I've also been thinking along the way about blogging in general. During the intervening months since my last post, I've tried to rethink what purpose I'm trying to serve, as well as looking at examples of bloggers at the extremes of style, while still staying within the bounds of tech. Notable examples include Ayende's insane month of June, 2007, in which he wrote 150 blog posts (more than I'm sure I tweet most months), and Steve Yegge, who wrote the longest review of Borderlands 2 I'm sure is in existence--this is somewhat uncharacteristic; he's one of my go-to reads for tech interviewers and interviewees--but hasn't written anything publicly for about 15 months. I'm not sure where I'll eventually fall over time (though I lean, as a reader, toward the Yegge model), but I have settled on some topics and themes for the forthcoming months, attempting to be both less and more pretentious.

Oh, and before I forget, relating to working at Lumos, I'd like to just put it on the Web here that R is the PHP of data analysis. I've mentioned the comparison to friends, and it's like watching them achieve a lesser form of Illumination in front of me. That shit looks like its API has been cobbled together over decades by non-programmers, and I'm sure that's somewhere close to the truth.

Probabilistic data structures

So one direction that I've decided to go is to delve into describing, providing motivation for, and implementing probabilistic data structures. It's somewhat self-serving, and in the case of Bloom filters, there already has been quite a lot said, though I suspect even those have not yet been exhausitively evangelized. My basic model for this is Kyle Kingsbury's fantasically excellent Jepsen series on exploring partition tolerance and recovery models for modern distributed databases. If you aren't already reading along, it comes highly, highly recommended. Similarly, his series on Clojure from the ground up is phenomenal, and not just for its lucidity or technical merit; here's paragraph 2:

Science, technology, engineering, and mathematics are deeply rewarding fields, yet few women enter STEM as a career path. Still more are discouraged by a culture which repeatedly asserts that women lack the analytic aptitude for writing software, that they are not driven enough to be successful scientists, that it’s not cool to pursue a passion for structural engineering. Those few with the talent, encouragement, and persistence to break in to science and tech are discouraged by persistent sexism in practice: the old boy’s club of tenure, being passed over for promotions, isolation from peers, and flat-out assault. This landscape sucks. I want to help change it.

How amazing is this guy? Anyway, similar to naming his series 'Jepsen' after Carly Rae Jepsen's "Call Me, Maybe", my project's working title is going to be 'Gallagher', after the Gallagher brothers of Oasis, whose debue albume was "Definitely Maybe", which seems an apt name for a walkthrough of probabilistic data structures.

This is certainly going to take awhile to read, ingest, synthesize, and write all of this, but I imagine the order will go something like:

  • Bloom filters
  • HyperLogLog
    • Is a 'near-optimal' cardinality estimation structure of which I'm completely unfamiliar of the inner workings of at present, but understand that it is basically made of magic.
  • Count-Min Sketch
    • Can be thought of simplistically as a kind of generalization of bloom filters. It uses two-dimensional arrays, instead of one-dimensional, and you're counting, instead of setting bits, but the why and how are fairly similar.

Modular synthesizers

This is something of a shorter topic, but in my personal life, I've been diving headfirst into modular synths for the past few months. It can get expensive, so I'm limiting myself to a reasonable budget, and building some units, like this random sequence generator from kits. I feel like this is nerdy enough to get lumped into the same bucket of posts, and I think I'll find some interesting insight along the way to share. This also means I've spent far too much time on MuffWiggler.

I recently stumbled upon this article from a former neighbor's Twitter feed, which is about the generativity of programming languages, and how that relates (or doesn't) to modern copyright law. It's a great article by a student who understands technology, (some aspects of) philosophy of technology, and communication; hope for the future, n'all that. That said, it's formed more as a question than a roadmap, and it tends to get both lost in some of the technical weeds, and not far out enough into others. I'm starting to do that myself here, but for its deep-dive-into-C-compilation-for-a-short-article, notable non-mentions include Prolog, progenitor of 'version 1' in the article, and Clang, a now-ubiquitous-thanks-to-Apple C language compiler frontend, which adds yet more generative complexity to the same examples given.

So let's look at that first example, which caught my eye:

alice = one_of(1,2,3)
bob = one_of(1,2,3)
claire = one_of(1,2,3)
forbid(alice = bob  or  bob = claire  or  alice = claire)
forbid(bob = 3)
forbid(alice – claire = 1  or  claire – alice = 1)
forbid(claire < bob)
display(alice, bob, claire)

which is accompanied by the statement: "As it turns out, almost no mainstream programming language can easily express a program that looks like the first version." Good thing I don't like mainstream programming languages. So as mentioned, we can start with the fact that Prolog can do this, and more than likely informs the example (though I'm fairly certain this is at least some level of pseudocode, without brew install swi-prolog). But it's cool to note; Prolog is basically a notation for an inference engine.

I've got another blog post broaching the topic, but I've recently moved on to Lumosity, where we're using Cascalog. You can hopefully see the portmanteau coming a mile away. However, it actually refers not to Prolog at large, but Datalog, a super-early internal DSL of Prolog, which strictly refers to the inference of data and their relationships. The other half of the portmanteau is Cascading, which I'm not even going to bother linking, because it's not at issue here, and Get On With It, besides. In sum, Cascalog is a variant of a subset of Prolog, except that it runs on Hadoop, so you can reason over datasets unimaginable in 1977 (much less extant).

Anyway, after reading the article, being newish to Cascalog, and reading the quoted statement above as a challenge, my eyes lit up, and I had to write this trivial motivating example in the declarative version:

(<- 
  [?alice ?bob ?claire]
  ((one_of [1 2 3]) ?alice)
  ((one_of [1 2 3]) ?bob)
  ((one_of [1 2 3]) ?claire)
  (cross-join)

  (forbid ?alice = ?bob) (forbid ?bob = ?claire) (forbid ?alice = ?claire)
  (forbid ?bob = 3)
  (forbid ?alice - ?claire = 1) (forbid ?claire - ?alice = 1)
  (forbid ?claire < ?bob)))

There are some differences, to be sure. The first line is the projecting the values of the tuples out (which is the last line in the first example). The sixth line is a bit thornier. I'm not entirely sure yet (70%, say) of the technical reasoning, but because the primitives of Cascading don't, and weren't intended to, map one-to-one with that of Datalog, you can't just arbitrarily introduce streams of values. Cascading was ultimately built for ETLs, not reasoning over arbitrary data stores, and so it lends itself better to either transforming a single stream of tuples (including tuple generation, like a flatmap), or joining data that has some relation to the main stream of tuples (e.g. an inner or outer join on a lookup table). We really want the cross product of three tuple streams (or generators, in this case, the mystical one_of in the original example), and so we can use cross-join.

That said, using Cascalog 1.10.2, the current stable in Clojars, I wasn't able to get it working. While Cascading most certainly is available for enterprise support, the DSLs, written[ish] and maintained by Twitter, are, well, not. So things can get weird, if you walk into the weeds, like in cross-join--it's simply something that that docs say you shouldn't need in a Hadoop workflow very often and I'm not surprised it may have suddenly stopped working. Fortunately, we're using Leiningen, so this was as easy as bumping down the version of Cascalog, and re-running lein deps. Boom, works. I also introduced a filter operation forbid-sub to encapsulate subtraction (the usage here read like: "operand1 minus operand2 must not equal operand3"). There's probably a cleaner way to handle that, but Cascalog can get funny with Clojure-native operations on vars in a query, in a way that I'm not entirely familiar with yet, and fine, you have to build these out to discrete map/reduce tasks in Java, it's worth some pain.

Edit: Actually, in retrospect, there was a clean[ish] immediate solution here. Clojure supports "arity overloading", which basically means you're free to provide multiple implementations of a function, based on the number (though not type) of arguments. If we're worried about simply mimicking the interface informally laid out by the example, simply adding a five-argument implementation with op1, operator, op2, comparator, comp2 will suffice.

For the interested, the full gist is here.

†: Years ago as a recent grad, I naïvely asked a startup CEO over a phone interview why he'd choose to start the company in Silicon Valley, and not New York or elsewhere. Granted, there are some very, very smart people in New York, but the culture just doesn't hum with the kind of intellectual cross-pollination that goes on here. On the best days, it yields ideas that will change the world for the better; increasingly, just white men wearing Google Glass.

‡: By here, of course, I mean the Bay Area in general. In choosing to live [and love] in Oakland, I've basically resigned myself to never working outside of SF or Oakland for the foreseeable future.

In the past few weeks, I'd read via Twitter an awesome rationalization for immutability, which invokes a now-deeply-held bugaboo [Source]:

GOTO was evil because we asked, "how did I get to this point of execution?" Mutability leaves us with, "how did I get to this state?"

I've been [lazily] evangelizing here, to work colleagues, around the block, ad nauseum, about the benefits of modern functional languages, not least of which being a ground-up approach to embracing immutability (by default, if not by decree). This recently bit me in the ass with some code I'd naïvely written a few years ago.

Not uncommonly, we have certain classes of data--metadata, in this case--that are extremely read-heavy, and only updated on occasion; maybe months, potentially years. In this particular case, it's metadata about financial instruments; alternate identifiers, full names, which exchange it's listed on, that sort of thing. As one might expect, this is retrieved by almost every user-facing request in some capacity; if it doesn't scale, nothing scales. The retrieval system itself is something of hybrid Strategy pattern, where each concrete strategy looks roughly like:

string GetCacheKey(Tkey key)
Tval Get(string key)
void SetOther(Tkey key, Tval value)

The basic premise here is that each request walks a list of implementations, ordered by cost-of-access from least to greatest, and that any given strategy/implementation, if successful, has the ability to promote its result to a faster cache layer, typically with a shorter TTL. The first two implementations of this are basically reading directly from memory: the first is in-process, the second is out-of-process (e.g. external cache server[s]). A network hop, especially in EC2, is not free, and again, this data doesn't change much. If we have to go out to the cache server, let's at least promote it to in-process cache, so that we have faster access next time.

Fine, great. Early in the project, I'd added copy-on-set logic to the in-process cache, so that if a request retrieved an object from slow storage and promoted, the caller can freely modify it for the lifetime of the request, while leaving a pristine copy in cache.

lulz

Like "knowing," that was half the battle, and that suited perfectly fine for a long time, because this metadata wasn't mutated by convention. Even the error case that eventually cropped up wasn't caught by any tests. At some point, some additional entitlement restrictions were added, requiring that if a user did not have particular licensing documentation in place, a few fields would need to be scrubbed before being returned.

So who peed in the object pool?

So yeah, that doesn't cut it. The thing is, while we had unit tests with pretty good coverage, no one, not test accounts, not gulp live users, caught it for a long time, because not many people had the proper licenses. When it was--painfully, and after over too long a period--discovered, it became clear that what was going down was that:

  • At some point, a user retrieves a value warm in cache
  • The user did not have the proper license, and the fields were scrubbed
  • All subsequent reads, until the key expired, had those fields scrubbed

So there's your culprit. The pool-pisser. And the immediate solution there was pretty simple: adding copy-on-read semantics to the in-process cache. I guess. Begrudgingly. The way forward? Oh, I don't pretend to know. But making things immmutable, at least by default, seems like a good start. Or copy-on-write, which would be transparent to the user, and carry the safety of immutability, but (on first blush) it seems like adding that kind of complexity to all your precious POJOs and POCOs and POROs would be overkill.

Seems.

Recently, during what should probably be a more broadly-practiced routine of wearing other people's hats for a week, I delved into some pretty spotty code (not the person whose hat I was wearing, just some legacy code). Unfortunately, this code also depends on the kindness of strangers (well, vendors) to not change the format of the data it's parsing (some of which may be very unclean). Upon figuring out what it was doing, and were it was going wrong, I made mention on an email thread that the retry logic is somewhat deficient. A bit later, this was met with something along the lines of:

try {
    doSomething();
}
catch {
    doSomething();
}

Well, yes, that's a good start--obviously, there are better generalized patterns for doing this, but that's out of scope for now. The broader question is: what kind of failure modes are there, and how am I handling them? These can roughly be summed up into four modes, only one of which is actually handled correctly in the above scenario:

1. Ephemeral processing issues
2. Persistent processing issues
3. Ephemeral data issues
4. Persistent data issues

Let's break those down.


Ephemeral processing issues

Example: transaction becomes a database deadlock victim. Perhaps this should be spoiler tagged, but basically: this is the only failure mode actually corrected by the above changeset. What the changeset itself is doing is trying something, failing, and immediately [key] trying again. There are definitely subclasses of problems that can be addressed this way, but there are many others not covered by this, which I'll get to.

Alright, what can I do? Well, test, for one. You can unit test, but depending on the subclass, these sorts of things can be hard to mock, and fairly low-level. Some other subclasses you can fake with known-bad data, but there are unknown-unknowns, and it's hard to fake deadlocks. You can use stubs, but this seems really, really contrived, and of dubious value. Your best bets are being very pessimistic, building a great method for logging/analyzing errors, and knowing your stack.

Persistent processing issues

Example: your code is terrible. This is the simplest class of problems to understand, but it's not fixed with the changeset, and wouldn't be fixed with an infinite loop around a try/catch block. I'm not going to beat this dead horse, because:

Alright, what can I do? This is also the simplest class of problems to solve. Persistently-failing processes on a known set of data should be cataloged, understood, corrected, and met with regression tests. Persistently-failing processes on an unknown set of data should be met with exasperation and scotch.

Ephemeral data issues

Example: sorry that your data is bad, but it'll be good, we promise! This is actually the motivating example behind this post, and the fact that it isn't covered by the changeset is the motivation for the post itself. What was at issue is that some of the data was hyperlinked to other documents, which were not available at the time of processing, yielding broken processing. I'm not familiar with the underlying data set or its [lack of] atomicity, but this can be generalized to any data set that is eventually consistent. In this case, there's nothing intrinsically wrong with the concept of simply retrying, but the chronology is almost certainly wrong: you have little control over when the retry happens (and are likely going to be wrong if you try, depending on the semantics of the api/data source)

Alright, what can I do? Again, known unkowns and estimated unknowns can be tested, regressed against, etc. What's really at issue here is your retry framework, not your retry logic. If you can't process it now, reprocessing it immediately may not help, and you're not going to be able to synchronously solve the problem. Maybe. Discerning this class from the first requires familiarity with the data set. A good solution to this problem will involve persisting the job, and inserting it in a system of work retry queues. Depending on the scenario, this may be done at regular intervals, with some sort of exponential backoff, etc.

Pesistent data issues

Example: the data is, and will likely remain, in a format that's either currently not useful, or permanently not useful. This, too, should be met with scotch. This is arguably the worst case, because like #2, this will never work through retries alone, but unlike #2, may be out of your control entirely. Sometimes, this involves better parsing or processing, which may be a significant engineering investment. Sometimes, it just can't.

Alright, what can I do? Find a nice Islay. In most cases, the best scenario is to be able to discern this case from the others, and simply log and accept that this won't be able to be parsed. Reattempting in some fashion is just a waste of resources.

A friend and I were discussing this blog post by Joe Duffy, about type classes and C#. It's deep enough that it took a second reading to really gel, but we'd discussed in person after the first time, in which I'd mostly skimmed the samples, and just read the commentary. Obviously, this is not verbatim (nor did I spell out the symbols):

Me:

I've been meaning to read 'Learn You a Haskell for Great Good' for a while now, but I know literally zero Haskell. Function signature syntax like "isin :: Eq a => a -> [a] -> Bool" just seems unnecessarily...Perlish in its use of symbols.

Him:

Yeah, all function type defs in Haskell are in the form of, err...you know, when a function taking multiple arguments is represented as successive single-argument functions?

Me:

Ah, right; currying. That makes so much more sense, in retrospect.

At the time, I shrugged off its forced use in Haskell, chalking it up to intentional esotericism. Some 15 hours later I was in the shower, having a Doc Brown moment. Hey! That is the actual man's friggin' last name.

Also, about the title.