post image is free from here

Big Data Hackery

Today, as I sat down to throw some lunch down my throat, I saw #myFirstCar go by on Twitter. Naturally I made a jokey post about it on Facebook and on the train ride home it occurred to me I should probably expand because a lot of people might not understand the inherent dangers of playing such an innocuous seeming little game. If nothing else you could consider this to be a very abbreviated "Introduction to Big Data" seminar, so I'll at least save you a couple hundred dollars and an afternoon of your time.

The Post

Basically, this was my post:

Seen the #MyFirstCar hashtag go by on twitter a few times. How to have some fun:
1. Twitter streaming API to Hadoop to Elasticsearch
2. Kibana => Visualize => Data Table, Lucene query: { "terms": { "hashtag": [ 'myfirstcar','momsmaidenname','mypet','myfirstschool','childhoodstreet', 'favoritecolor' ], "tag_count": 3 } }
3. ...

... Oh, that's not what people mean by "big data hacking"? My bad.

Pithy and kind of super duperly technical, I know: So let's walk through it.

Twitter Streaming API

This is exactly as it sounds: A feed you can request from Twitter that will basically stream you tweets matching certain criteria. In our case, for the example above, we'd be interested in #myfirstcar, #momsmaidenname, #mypet, #myfirstschool, #childhoodstreet, #favouritecolour...

... Hey wait. Those kinda sound familiar right? Like the last time you signed up for something? Well, let's keep going: It only gets more fun from here.

Apache Hadoop

In the simplest and most vulgar terms, Hadoop is designed to comb through a metric fuckton (we're talking terabytes) of data, have its way with it it, then dump it somewhere else. It's of use to us because the tweet data in the feed contains far, far, far more information than we actually need so we're going to use it to perform an operation called "Map/Reduce". This is the term the hepcats use for taking a bunch of data and picking out the little bits they want.

Basically, think of it as a sausage machine: You just keep throwing shit in the top and, when you turn the handle, it comes out the other side all neatly packaged.

Ed: In reality you'd need a lot more than this because you'd have to stitch the streaming API on to Hadoop's MapReduce ... Okay, fine, all you'd need are Flume, twitter4j, and a couple dozen lines of Java, but if you want to be a dick about things you can do even less work by giving Amazon a few pennies and feeding a Kinesis Data Stream in to EMR. [Middle finger emoji]


In the beginning, computers stored their data in databases. Databases are generally relational in that they contain several tables with rigid structures, all of which are connected somehow, usually by an auto-generated numeric ID. Elasticsearch is not a database, it's more like a drawer in a filing cabinet. Well, actually it's more like a filing cabinet where data is sharded across all the drawers so, no matter how full it gets, it'll still be able to crunch through it faster than a flushing toilet.

What we're going to have it do is store all of the interesting data that Hadoop pulls out of the tweets. Bare minimum we'd keep username, timestamp, hashtags (because you never know), location (if they had it on), and the actual tweet itself.

Ed: In the cloud, you'd just tell EMR to send whatever to the Elasticsearch Service. Rolling your own you'd need Elasticsearch for Hadoop.


The same company that makes Elasticsearch also makes a visualization tool called Kibana. When you hear a collection of buzzwords resembling "big data driven business intelligence dashboard", what they mean is they're using Kibana to make a bunch of pretty graphs out of data stored in Elasticsearch. (Or they're using something like PowerBI [no link haha], but that's made by Microsoft and, well: This is X. X thinks Microsoft is the answer. Microsoft is never the answer. Don't be like X.)

The things separated by arrows in this step are literally what you would click on in its interface. First, we tell it we want to "Visualize" data. Next, we tell it the format in which we want it to do so (this is where we'd pick like graph type, heat map, single number, gauge, etc). The bit after the comma? How's about some stepception:

Apache Lucene

Under the hood, Elasticsearch is an extremely badass full text search engine called Lucene. Think of it as those dumb terminals the library had: You can either have it blast through everything to find stuff that vaguely resembles a phrase (kind of like your own Google, it even returns relevancy scores) or be extremely specific (Literally, Author:This Person AND Subject:Hacking).

The stuff in brackets is a JSON blob (the squiggly brackets mean "I'm giving you a bunch of this equals thats" and the square brackets are lists of things). Rather than a "query" (the Google-like bit), we're telling it we want "terms" (the author bit) and what we're looking for is anybody with tweets matching at least 3 of these hashtags.

Back at it

So, literally what I'm staring at now is an automatically updating table where the first column is a Twitter username, the columns after are the text of every tweet, basically:

@myhandleNo way, it's all about green! #teamgreen #favouritecolourMan you guys are lucky, I had a Model T. #myfirstcarAll the cool kids went to Robert Baldwin #myfirstschool 
@theirhandleNobody ever had as much fun as the #satokcrew #milton #childhoodstreetMy old dog CW was the absolute shit #myfirstpet #dogEverybody feared my F150. #myfirstcarYou weren't cool unless you were a Howard #momsmaidenname

It might not be formatted pretty, but we're not convincing a C-level executive to greenlight our latest hairbrained scheme, we're pwning people's shit. This is fine.

Ed: I tend to use Kibana as nothing more than a web interface for Elasticsearch. My personal preference is Grafana because Kibana can only talk to Elasticsearch and my shit's all over the place.

Ed2: No, that query will not return the data as nicely tabular as I describe. As with the MacGyver episodes involving explosives, there is a bunch of shit missing around it. To make it even work you'd need to add another section to create an — Heyyyyy, I see what you did there. Tricky sumbitch.

How it works

First, I'm going to go technical: As the Twitter streaming API feeds tweets to Flume via twitter4j, it writes them somewhere Hadoop can get at them. When Hadoop realizes they're there, it goes through the MapReduce operation you coded and feeds the result in to Elasticsearch. Once they're there, you can use Kibana's time range selector to ensure you're getting "fresh" information or just watch for anything new.

Now, the easy way: The Twitter streaming API throws shit in to the top of the sausage maker that is Hadoop. As you turn the handle and cut them off, they fall in to the deep freeze that is Elasticsearch. Kibana is the chef so it demands Elasticsearch give it an Oktoberfest and two Kielbasa. It then cooks them and serves them up on the plate that is your web browser.

You can even take a modified version of that query and have it send you an email whenever a new hit shows up.

Staying Safe

I openly admit that, of the examples above, some are factual. The ones that are will never correspond to anything I used as an answer. The rest? Well, they're simply misleading bullshit because, c'mon, who knows I didn't grow up on Satok?

In fact, if you search for the #myfirstcar hashtag, you'll see people posting pictures of like powerwheels and tricked out tricycles and bigwheels and shit, that's how you have fun playing the game while not providing any actually useful information when you inevitably show up in mythat Kibana table.

Ed: And no, this isn't an optimal setup. It literally came to me in the 5 minutes it took to write that Facebook post. The crux of it is stay safe because there are people like me out there, and they do not draw the line at this just being an intellectual exercise.

Ed2: There's a chance I misunderstood the term "big data hacking".