A few weeks ago I mentioned Yahoo! Labs was working on something called S4 for real-time data analysis. Yesterday they released an 8 page paper with detailed description of how and why they built this. Here is the abstract from the paper.
Its interesting to note that the authors compared S4 with MapReduce and explained that MapReduce was too optimized for batch process and wasnâ€™t the best place to do real time computation. They also made an architectural decision of not building a system which can do both offline (batch) processing and real-time processing since they feared such a system would end up to be not good for either.
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model , providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.