27 August 2014

At Metamarkets, we run a lambda architecture comprised of Kafka, Storm, Hadoop, and Druid to power interactive historical and real-time analysis of event streams. Around the office, we tongue-in-cheekily call the setup the "RAD stack." It stands for "Real-time analytics data stack," you see. As of this writing, the system handles some tens of billions of new events per day.

In our version of the architecture, Kafka acts as the origin data source for both real-time and batch pipelines. In the real-time pipelines, Storm topologies read events from Kafka, process them, and stream them into Druid. In the batch pipelines, all events are copied from Kafka to S3 and are then processed and loaded into Druid by a Hadoop flow that applies the same logic as the Storm topology. We use a high-level abstraction to generate both the Hadoop and Storm jobs, although it's also possible to write them independently.

Druid natively supports both real-time and batch ingestion, so we didn't need to implement anything extra to perform aggregations or merge real-time and batch results. Once loaded into Druid, batch-ingested data automatically replaces the data that was loaded in real-time for the same interval, providing a unified view.

If you're interested, you are invited to read a full post about this: https://metamarkets.com/2014/building-a-data-pipeline-that-handles-billions-of-events-in-real-time/. We also have a talk from Strata 2014 and corresponding slide deck online.



blog comments powered by Disqus

Fork me on GitHub