Definitely Maybe

08 March 2014

Over the past few months, I've spent spent most of my time settling into my new (well, now not-so-new) role from working in backend Web architecture in .NET to building out data warehousing and analysis systems. It's quite a difference, not least of which going from the B2B world to the massive online B2-Web-at-large world--though most "full-stack engineer" work can be somewhat agnostic to the differences, if you're careful about your architecture.

I've also been thinking along the way about blogging in general. During the intervening months since my last post, I've tried to rethink what purpose I'm trying to serve, as well as looking at examples of bloggers at the extremes of style, while still staying within the bounds of tech. Notable examples include Ayende's insane month of June, 2007, in which he wrote 150 blog posts (more than I'm sure I tweet most months), and Steve Yegge, who wrote the longest review of Borderlands 2 I'm sure is in existence--this is somewhat uncharacteristic; he's one of my go-to reads for tech interviewers and interviewees--but hasn't written anything publicly for about 15 months. I'm not sure where I'll eventually fall over time (though I lean, as a reader, toward the Yegge model), but I have settled on some topics and themes for the forthcoming months, attempting to be both less and more pretentious.

Oh, and before I forget, relating to working at Lumos, I'd like to just put it on the Web here that R is the PHP of data analysis. I've mentioned the comparison to friends, and it's like watching them achieve a lesser form of Illumination in front of me. That shit looks like its API has been cobbled together over decades by non-programmers, and I'm sure that's somewhere close to the truth.

Probabilistic data structures

So one direction that I've decided to go is to delve into describing, providing motivation for, and implementing probabilistic data structures. It's somewhat self-serving, and in the case of Bloom filters, there already has been quite a lot said, though I suspect even those have not yet been exhausitively evangelized. My basic model for this is Kyle Kingsbury's fantasically excellent Jepsen series on exploring partition tolerance and recovery models for modern distributed databases. If you aren't already reading along, it comes highly, highly recommended. Similarly, his series on Clojure from the ground up is phenomenal, and not just for its lucidity or technical merit; here's paragraph 2:

Science, technology, engineering, and mathematics are deeply rewarding fields, yet few women enter STEM as a career path. Still more are discouraged by a culture which repeatedly asserts that women lack the analytic aptitude for writing software, that they are not driven enough to be successful scientists, that it’s not cool to pursue a passion for structural engineering. Those few with the talent, encouragement, and persistence to break in to science and tech are discouraged by persistent sexism in practice: the old boy’s club of tenure, being passed over for promotions, isolation from peers, and flat-out assault. This landscape sucks. I want to help change it.

How amazing is this guy? Anyway, similar to naming his series 'Jepsen' after Carly Rae Jepsen's "Call Me, Maybe", my project's working title is going to be 'Gallagher', after the Gallagher brothers of Oasis, whose debue albume was "Definitely Maybe", which seems an apt name for a walkthrough of probabilistic data structures.

This is certainly going to take awhile to read, ingest, synthesize, and write all of this, but I imagine the order will go something like:

Bloom filters
- A lot examples already exist of this, and some excellent ones even go into mathematical detail, proving their efficacy, but these are so prevalent in large-scale systems, often to speed tasks that are not, intrinsically, probabilistic in nature, that I think it's worth [re]visiting.
HyperLogLog
- Is a 'near-optimal' cardinality estimation structure of which I'm completely unfamiliar of the inner workings of at present, but understand that it is basically made of magic.
Count-Min Sketch
- Can be thought of simplistically as a kind of generalization of bloom filters. It uses two-dimensional arrays, instead of one-dimensional, and you're counting, instead of setting bits, but the why and how are fairly similar.

Modular synthesizers

This is something of a shorter topic, but in my personal life, I've been diving headfirst into modular synths for the past few months. It can get expensive, so I'm limiting myself to a reasonable budget, and building some units, like this random sequence generator from kits. I feel like this is nerdy enough to get lumped into the same bucket of posts, and I think I'll find some interesting insight along the way to share. This also means I've spent far too much time on MuffWiggler.