m4rc.io

Definitely Maybe

marc@badfriend.org (Marc Bollinger) — Sat, 08 Mar 2014 00:00:00 -0800

Over the past few months, I've spent spent most of my time settling into my new (well, now not-so-new) role from working in backend Web architecture in .NET to building out data warehousing and analysis systems. It's quite a difference, not least of which going from the B2B world to the massive online B2-Web-at-large world--though most "full-stack engineer" work can be somewhat agnostic to the differences, if you're careful about your architecture.

I've also been thinking along the way about blogging in general. During the intervening months since my last post, I've tried to rethink what purpose I'm trying to serve, as well as looking at examples of bloggers at the extremes of style, while still staying within the bounds of tech. Notable examples include Ayende's insane month of June, 2007, in which he wrote 150 blog posts (more than I'm sure I tweet most months), and Steve Yegge, who wrote the longest review of Borderlands 2 I'm sure is in existence--this is somewhat uncharacteristic; he's one of my go-to reads for tech interviewers and interviewees--but hasn't written anything publicly for about 15 months. I'm not sure where I'll eventually fall over time (though I lean, as a reader, toward the Yegge model), but I have settled on some topics and themes for the forthcoming months, attempting to be both less and more pretentious.

Oh, and before I forget, relating to working at Lumos, I'd like to just put it on the Web here that R is the PHP of data analysis. I've mentioned the comparison to friends, and it's like watching them achieve a lesser form of Illumination in front of me. That shit looks like its API has been cobbled together over decades by non-programmers, and I'm sure that's somewhere close to the truth.

Probabilistic data structures

So one direction that I've decided to go is to delve into describing, providing motivation for, and implementing probabilistic data structures. It's somewhat self-serving, and in the case of Bloom filters, there already has been quite a lot said, though I suspect even those have not yet been exhausitively evangelized. My basic model for this is Kyle Kingsbury's fantasically excellent Jepsen series on exploring partition tolerance and recovery models for modern distributed databases. If you aren't already reading along, it comes highly, highly recommended. Similarly, his series on Clojure from the ground up is phenomenal, and not just for its lucidity or technical merit; here's paragraph 2:

Science, technology, engineering, and mathematics are deeply rewarding fields, yet few women enter STEM as a career path. Still more are discouraged by a culture which repeatedly asserts that women lack the analytic aptitude for writing software, that they are not driven enough to be successful scientists, that it’s not cool to pursue a passion for structural engineering. Those few with the talent, encouragement, and persistence to break in to science and tech are discouraged by persistent sexism in practice: the old boy’s club of tenure, being passed over for promotions, isolation from peers, and flat-out assault. This landscape sucks. I want to help change it.

How amazing is this guy? Anyway, similar to naming his series 'Jepsen' after Carly Rae Jepsen's "Call Me, Maybe", my project's working title is going to be 'Gallagher', after the Gallagher brothers of Oasis, whose debue albume was "Definitely Maybe", which seems an apt name for a walkthrough of probabilistic data structures.

This is certainly going to take awhile to read, ingest, synthesize, and write all of this, but I imagine the order will go something like:

Bloom filters
- A lot examples already exist of this, and some excellent ones even go into mathematical detail, proving their efficacy, but these are so prevalent in large-scale systems, often to speed tasks that are not, intrinsically, probabilistic in nature, that I think it's worth [re]visiting.
HyperLogLog
- Is a 'near-optimal' cardinality estimation structure of which I'm completely unfamiliar of the inner workings of at present, but understand that it is basically made of magic.
Count-Min Sketch
- Can be thought of simplistically as a kind of generalization of bloom filters. It uses two-dimensional arrays, instead of one-dimensional, and you're counting, instead of setting bits, but the why and how are fairly similar.

Modular synthesizers

This is something of a shorter topic, but in my personal life, I've been diving headfirst into modular synths for the past few months. It can get expensive, so I'm limiting myself to a reasonable budget, and building some units, like this random sequence generator from kits. I feel like this is nerdy enough to get lumped into the same bucket of posts, and I think I'll find some interesting insight along the way to share. This also means I've spent far too much time on MuffWiggler.

Where We're Going, We Don't Need Control Flow

marc@badfriend.org (Marc Bollinger) — Mon, 02 Sep 2013 00:00:00 -0700

I recently stumbled upon this article from a former neighbor's Twitter feed_†, which is about the generativity of programming languages, and how that relates (or doesn't) to modern copyright law. It's a great article by a student who understands technology, (some aspects of) philosophy of technology, and communication; hope for the future, n'all that. That said, it's formed more as a question than a roadmap, and it tends to get both lost in some of the technical weeds, and not far out enough into others. I'm starting to do that myself here, but for its deep-dive-into-C-compilation-for-a-short-article, notable non-mentions include Prolog, progenitor of 'version 1' in the article, and Clang, a now-ubiquitous-thanks-to-Apple C language compiler frontend, which adds yet more generative complexity to the same examples given.

So let's look at that first example, which caught my eye:

alice = one_of(1,2,3)
bob = one_of(1,2,3)
claire = one_of(1,2,3)
forbid(alice = bob  or  bob = claire  or  alice = claire)
forbid(bob = 3)
forbid(alice – claire = 1  or  claire – alice = 1)
forbid(claire < bob)
display(alice, bob, claire)

which is accompanied by the statement: "As it turns out, almost no mainstream programming language can easily express a program that looks like the first version." Good thing I don't like mainstream programming languages. So as mentioned, we can start with the fact that Prolog can do this, and more than likely informs the example (though I'm fairly certain this is at least some level of pseudocode, without brew install swi-prolog). But it's cool to note; Prolog is basically a notation for an inference engine.

I've got another blog post broaching the topic, but I've recently moved on to Lumosity, where we're using Cascalog. You can hopefully see the portmanteau coming a mile away. However, it actually refers not to Prolog at large, but Datalog, a super-early internal DSL of Prolog, which strictly refers to the inference of data and their relationships. The other half of the portmanteau is Cascading, which I'm not even going to bother linking, because it's not at issue here, and Get On With It, besides. In sum, Cascalog is a variant of a subset of Prolog, except that it runs on Hadoop, so you can reason over datasets unimaginable in 1977 (much less extant).

Anyway, after reading the article, being newish to Cascalog, and reading the quoted statement above as a challenge, my eyes lit up, and I had to write this trivial motivating example in the declarative version:

(<- 
  [?alice ?bob ?claire]
  ((one_of [1 2 3]) ?alice)
  ((one_of [1 2 3]) ?bob)
  ((one_of [1 2 3]) ?claire)
  (cross-join)

  (forbid ?alice = ?bob) (forbid ?bob = ?claire) (forbid ?alice = ?claire)
  (forbid ?bob = 3)
  (forbid ?alice - ?claire = 1) (forbid ?claire - ?alice = 1)
  (forbid ?claire < ?bob)))

There are some differences, to be sure. The first line is the projecting the values of the tuples out (which is the last line in the first example). The sixth line is a bit thornier. I'm not entirely sure yet (70%, say) of the technical reasoning, but because the primitives of Cascading don't, and weren't intended to, map one-to-one with that of Datalog, you can't just arbitrarily introduce streams of values. Cascading was ultimately built for ETLs, not reasoning over arbitrary data stores, and so it lends itself better to either transforming a single stream of tuples (including tuple generation, like a flatmap), or joining data that has some relation to the main stream of tuples (e.g. an inner or outer join on a lookup table). We really want the cross product of three tuple streams (or generators, in this case, the mystical one_of in the original example), and so we can use cross-join.

That said, using Cascalog 1.10.2, the current stable in Clojars, I wasn't able to get it working. While Cascading most certainly is available for enterprise support, the DSLs, written[ish] and maintained by Twitter, are, well, not. So things can get weird, if you walk into the weeds, like in cross-join--it's simply something that that docs say you shouldn't need in a Hadoop workflow very often and I'm not surprised it may have suddenly stopped working. Fortunately, we're using Leiningen, so this was as easy as bumping down the version of Cascalog, and re-running lein deps. Boom, works. I also introduced a filter operation forbid-sub to encapsulate subtraction (the usage here read like: "operand1 minus operand2 must not equal operand3"). There's probably a cleaner way to handle that, but Cascalog can get funny with Clojure-native operations on vars in a query, in a way that I'm not entirely familiar with yet, and fine, you have to build these out to discrete map/reduce tasks in Java, it's worth some pain.

Edit: Actually, in retrospect, there was a clean[ish] immediate solution here. Clojure supports "arity overloading", which basically means you're free to provide multiple implementations of a function, based on the number (though not type) of arguments. If we're worried about simply mimicking the interface informally laid out by the example, simply adding a five-argument implementation with op1, operator, op2, comparator, comp2 will suffice.

For the interested, the full gist is here.

†: Years ago as a recent grad, I naïvely asked a startup CEO over a phone interview why he'd choose to start the company in Silicon Valley, and not New York or elsewhere. Granted, there are some very, very smart people in New York, but the culture just doesn't hum with the kind of intellectual cross-pollination that goes on here_‡. On the best days, it yields ideas that will change the world for the better; increasingly, just white men wearing Google Glass.

‡: By here, of course, I mean the Bay Area in general. In choosing to live [and love] in Oakland, I've basically resigned myself to never working outside of SF or Oakland for the foreseeable future.

Who Peed in the Object Pool?

marc@badfriend.org (Marc Bollinger) — Wed, 12 Jun 2013 00:00:00 -0700

In the past few weeks, I'd read via Twitter an awesome rationalization for immutability, which invokes a now-deeply-held bugaboo [Source]:

GOTO was evil because we asked, "how did I get to this point of execution?" Mutability leaves us with, "how did I get to this state?"

I've been [lazily] evangelizing here, to work colleagues, around the block, ad nauseum, about the benefits of modern functional languages, not least of which being a ground-up approach to embracing immutability (by default, if not by decree). This recently bit me in the ass with some code I'd naïvely written a few years ago.

Not uncommonly, we have certain classes of data--metadata, in this case--that are extremely read-heavy, and only updated on occasion; maybe months, potentially years. In this particular case, it's metadata about financial instruments; alternate identifiers, full names, which exchange it's listed on, that sort of thing. As one might expect, this is retrieved by almost every user-facing request in some capacity; if it doesn't scale, nothing scales. The retrieval system itself is something of hybrid Strategy pattern, where each concrete strategy looks roughly like:

string GetCacheKey(Tkey key)
Tval Get(string key)
void SetOther(Tkey key, Tval value)

The basic premise here is that each request walks a list of implementations, ordered by cost-of-access from least to greatest, and that any given strategy/implementation, if successful, has the ability to promote its result to a faster cache layer, typically with a shorter TTL. The first two implementations of this are basically reading directly from memory: the first is in-process, the second is out-of-process (e.g. external cache server[s]). A network hop, especially in EC2, is not free, and again, this data doesn't change much. If we have to go out to the cache server, let's at least promote it to in-process cache, so that we have faster access next time.

Fine, great. Early in the project, I'd added copy-on-set logic to the in-process cache, so that if a request retrieved an object from slow storage and promoted, the caller can freely modify it for the lifetime of the request, while leaving a pristine copy in cache.

lulz

Like "knowing," that was half the battle, and that suited perfectly fine for a long time, because this metadata wasn't mutated by convention. Even the error case that eventually cropped up wasn't caught by any tests. At some point, some additional entitlement restrictions were added, requiring that if a user did not have particular licensing documentation in place, a few fields would need to be scrubbed before being returned.

So who peed in the object pool?

So yeah, that doesn't cut it. The thing is, while we had unit tests with pretty good coverage, no one, not test accounts, not gulp live users, caught it for a long time, because not many people had the proper licenses. When it was--painfully, and after over too long a period--discovered, it became clear that what was going down was that:

At some point, a user retrieves a value warm in cache
The user did not have the proper license, and the fields were scrubbed
All subsequent reads, until the key expired, had those fields scrubbed

So there's your culprit. The pool-pisser. And the immediate solution there was pretty simple: adding copy-on-read semantics to the in-process cache. I guess. Begrudgingly. The way forward? Oh, I don't pretend to know. But making things immmutable, at least by default, seems like a good start. Or copy-on-write, which would be transparent to the user, and carry the safety of immutability, but (on first blush) it seems like adding that kind of complexity to all your precious POJOs and POCOs and POROs would be overkill.

Seems.

Who Moved My Data Cheese?

marc@badfriend.org (Marc Bollinger) — Wed, 15 May 2013 00:00:00 -0700

Recently, during what should probably be a more broadly-practiced routine of wearing other people's hats for a week, I delved into some pretty spotty code (not the person whose hat I was wearing, just some legacy code). Unfortunately, this code also depends on the kindness of strangers (well, vendors) to not change the format of the data it's parsing (some of which may be very unclean). Upon figuring out what it was doing, and were it was going wrong, I made mention on an email thread that the retry logic is somewhat deficient. A bit later, this was met with something along the lines of:

try {
    doSomething();
}
catch {
    doSomething();
}

Well, yes, that's a good start--obviously, there are better generalized patterns for doing this, but that's out of scope for now. The broader question is: what kind of failure modes are there, and how am I handling them? These can roughly be summed up into four modes, only one of which is actually handled correctly in the above scenario:

1. Ephemeral processing issues

2. Persistent processing issues

3. Ephemeral data issues

4. Persistent data issues

Let's break those down.

Ephemeral processing issues

Example: transaction becomes a database deadlock victim. Perhaps this should be spoiler tagged, but basically: this is the only failure mode actually corrected by the above changeset. What the changeset itself is doing is trying something, failing, and immediately [key] trying again. There are definitely subclasses of problems that can be addressed this way, but there are many others not covered by this, which I'll get to.

Alright, what can I do? Well, test, for one. You can unit test, but depending on the subclass, these sorts of things can be hard to mock, and fairly low-level. Some other subclasses you can fake with known-bad data, but there are unknown-unknowns, and it's hard to fake deadlocks. You can use stubs, but this seems really, really contrived, and of dubious value. Your best bets are being very pessimistic, building a great method for logging/analyzing errors, and knowing your stack.

Persistent processing issues

Example: your code is terrible. This is the simplest class of problems to understand, but it's not fixed with the changeset, and wouldn't be fixed with an infinite loop around a try/catch block. I'm not going to beat this dead horse, because:

Alright, what can I do? This is also the simplest class of problems to solve. Persistently-failing processes on a known set of data should be cataloged, understood, corrected, and met with regression tests. Persistently-failing processes on an unknown set of data should be met with exasperation and scotch.

Ephemeral data issues

Example: sorry that your data is bad, but it'll be good, we promise! This is actually the motivating example behind this post, and the fact that it isn't covered by the changeset is the motivation for the post itself. What was at issue is that some of the data was hyperlinked to other documents, which were not available at the time of processing, yielding broken processing. I'm not familiar with the underlying data set or its [lack of] atomicity, but this can be generalized to any data set that is eventually consistent. In this case, there's nothing intrinsically wrong with the concept of simply retrying, but the chronology is almost certainly wrong: you have little control over when the retry happens (and are likely going to be wrong if you try, depending on the semantics of the api/data source)

Alright, what can I do? Again, known unkowns and estimated unknowns can be tested, regressed against, etc. What's really at issue here is your retry framework, not your retry logic. If you can't process it now, reprocessing it immediately may not help, and you're not going to be able to synchronously solve the problem. Maybe. Discerning this class from the first requires familiarity with the data set. A good solution to this problem will involve persisting the job, and inserting it in a system of work retry queues. Depending on the scenario, this may be done at regular intervals, with some sort of exponential backoff, etc.

Pesistent data issues

Example: the data is, and will likely remain, in a format that's either currently not useful, or permanently not useful. This, too, should be met with scotch. This is arguably the worst case, because like #2, this will never work through retries alone, but unlike #2, may be out of your control entirely. Sometimes, this involves better parsing or processing, which may be a significant engineering investment. Sometimes, it just can't.

Alright, what can I do? Find a nice Islay. In most cases, the best scenario is to be able to discern this case from the others, and simply log and accept that this won't be able to be parsed. Reattempting in some fashion is just a waste of resources.

We Call That 'Schönfinkelling' Where I Come From

marc@badfriend.org (Marc Bollinger) — Sun, 10 Mar 2013 00:00:00 -0800

A friend and I were discussing this blog post by Joe Duffy, about type classes and C#. It's deep enough that it took a second reading to really gel, but we'd discussed in person after the first time, in which I'd mostly skimmed the samples, and just read the commentary. Obviously, this is not verbatim (nor did I spell out the symbols):

Me:

I've been meaning to read 'Learn You a Haskell for Great Good' for a while now, but I know literally zero Haskell. Function signature syntax like "isin :: Eq a => a -> [a] -> Bool" just seems unnecessarily...Perlish in its use of symbols.

Him:

Yeah, all function type defs in Haskell are in the form of, err...you know, when a function taking multiple arguments is represented as successive single-argument functions?

Me:

Ah, right; currying. That makes so much more sense, in retrospect.

At the time, I shrugged off its forced use in Haskell, chalking it up to intentional esotericism. Some 15 hours later I was in the shower, having a Doc Brown moment. Hey! That is the actual man's friggin' last name.

Also, about the title.

Out, damned spot! Out, I say! [Pt 1]

marc@badfriend.org (Marc Bollinger) — Sat, 02 Mar 2013 00:00:00 -0800

Another post that starts its life as an email response to a coworker:

What's our spot instance strategy?

We don't really have one. Nor, I suspect, do most people. If you've somehow managed to end up here, you're most likely at least passingly familiar with EC2 spot instances, a concept that sounds cooler in theory than it is, but whose practicality can reach peak awesome, if you build your systems correctly, which I'll get into.

At first glance, it seems very interesting because, hey, it's raw computing power, sold on the open market! This sounds like the future, man. About a year ago, I was working closely with a product manager at a commodities exchange on a project involving all manner of derivatives, and in discussing cloud computing at the time, it was hard not to joke about crazy, exotic leveraging schemes and complex derivative instruments, based on computing. After all, the use of the term "spot" here is borrowed directly from a related concept in financial markets.

But you'll immediately notice some differences in the market behavior though, looking at the pricing history charts available. There's no Brownian-style motion (you can also check out how Brownian motion is applied to markets, but there aren't any gifs). Pricing over time tends to resemble a step function. There are some ways in which the EC2 spot market differs from a real, actual Market which play out in odd ways.

So how does the spot market differ from, say, an equities market (think NASDAQ)? This list may lean somewhat on financial jargon, but I'll disambiguate immediately after:

Amazon only quotes one side of the book
Market participants do not have full depth-of-book insight (or any insight)
Spot instance prices are necessarily chained to the on-demand (and reserved) price

Amazon only quotes one side of the book

Typical financial markets have a concept known as a market maker: an entity from whom you can both buy and sell a thing. One way to conceptualize this would be a currency exchange booth at an international airport; they have different prices for each direction of conversion, and pocket the difference, hoping to make enough money on the spread to offset the market volaltility (jay kay, they always do, especially at airports). This differs greatly from what Amazon does, because you can't buy a spot contract, wait for some upward price pressure, and unload your position for a profit--at the cost of system throughput, obviously; we are still talking about computers, believe it or not. There's nothing inherently wrong with this model, but it's not particularly market-like. That said, while I'm not a financier by training or trade, I'm pretty certain that in the absence of the other bullet points (among other things), even if you AMZN did quote both sides of the book, it would make working with spot instances a hellish, unlivable nightmare, unless you really, truly could not care less about spot availability.

Market participants do not have full depth-of-book insight (or any insight)

This is also a pretty big departure from real, adult markets. You may not actually need full depth-of-book access to trade securities. Maybe you got a hot tip from an AdWords ad in Gmail. Maybe you're timing the market. Maybe you're actually an idiot. Who knows! Point being, even free-ass Yahoo! Finance will give you top-of-book right now, and what any respectable algorithm will be doing is taking into account just what the stakes are on either side of the book.

To de-jargonate a bit, going back to bullet #1, "the book" in finance refers to the order book. Orders get placed into the exchange for something, and bids lower than the lowest offer, or offers higher than lowest bid, get stacked with each other, along with their respective volumes. Real market participants pay Cash Money to have access to exactly what the full state of the market looks like, because it drives the direction of the price. If people are looking to unload AAPL (which they have for some weeks now), this will be reflected in downward price pressure, as they'll quickly exhaust the existing Best Bids, and people will replace them with even lower bids, ad nauseum. Basically: just knowing the intentions of others in the market (at least, as much as they're willing to divulge; this too is a strategy) affects the market.

None of this nonsense exists in the spot world, and as a result, everyone just guesses how high everyone else is willing to go, and [hopefully] sets their price appropriately to how much risk they can tolerate, again in terms of system throughput. Here we are, talking about stupid computers again.

Spot instance prices are necessarily chained to the on-demand (and reserved) price

This is really the largest departure, in my opinion. For however much people are prattling on about the price of gold or Google, there is no externally-mandated value for these things. By comparison, it would make no sense to start making spot requests in a hot market, and live in constant fear of your instances being yanked, when you could just start up some plain ol' on-demand instances. This creates some odd effects, the most noticeable of which being that the market will always tend back to a very cheap equilibrium. Between the inherent risk and the fact that there's a fixed price for the underlying good, unless the standard, on-demand EC2 instances start seeing high rates of startup requests being denied due to capacity problems, it would make no sense for spots to ever stray. You'll never actually see that play out though, because that would involve both an incredible lack of capacity planning on Amazon's part, but also an overt move to not draw capacity for spot from the same pool as on-demand instances. That would not only be insane business--not to mention cutting off your nose to spite your face--but it would run counter to their [sensible] goal of consistency. You get a hefty discount just by paying upfront for reserved instnaces, allowing them to better plan capacity as it is.

There are certainly other deviations, but these are the ones that really pop out to me; put together, you get...not a market. I suspect this is why for the first few years, you'd see the term 'market' pop up in the AWS marketing, product descriptions, news articles, but its use has pretty much been scrubbed, save for some leftover Google ghosts. I didn't initially intend for this two be a two-part post, and I already have pictures to look at for the second half, but it was just too engrossing to give up the armchair markets discussion. I'll get into the actual computers in the next post.

Manufacturing Consent

marc@badfriend.org (Marc Bollinger) — Tue, 26 Feb 2013 00:00:00 -0800

From a coworker:

Sounds good to me. I cannot wait to see those emails go away. :)

Magically, for the past two weeks, we saw a fairly high percentage (nearing 50%) of our EC2 instances start up with misconfigured local drives. And obviously, this makes the system complain a lot. Getting a chance to dive a little deeper over the weekend, I was able to make it complain a lot more.

The quote above refers to a perfectly rational approach to solving the problem, but it did occur to me that this is basically Propaganda 101: manufacture an unendurable scenario, propose to the public a terrible--but manageable--solution, and let them come to their own conclusion. Step 4, profit.

Fist Proust

marc@badfriend.org (Marc Bollinger) — Mon, 11 Feb 2013 00:00:00 -0800

À la recherche d'un bon début.