Real-Time MapReduce using S4
While trying to figure out how to do real-time log analysis in my own organization I realized that most map-reduce frameworks are designed to run as batch jobs in time delays manner rather than be instantaneous like a SQL query to a Mysql DB. There are some frameworks which are bucking the trend. Yahoo! Lab! recently announced that their â€œAdvertising Sciencesâ€ group has built a general purpose, real-time, distributed, fault-tolerant, scalable, event driven, expandable platform called â€œS4â€ which allows programmers to easily implement applications for processing continuous unbounded streams of data.
S4 clusters are built using low-cost commoditized hardware, and leverage many technologies from Yahoo!â€™s Hadoop project. S4 is written in Java and uses the Spring Framework to build a software component architecture. Over a dozen pluggable modules have been created so far.
Why do we need a real-time map-reduce framework?
Applications such as personalization, user feedback, malicious traffic detection, and real-time search require both very fast response and scalability. In S4 we abstract the input data as streams of key-value pairs that arrive asynchronously and are dispatched intelligently to processing nodes that produce data sets of output key-value pairs. In search, for example, the output data sets are made available to the serving system before a user executes her next search query. We use this rapid feedback to adapt the search models based on user intent
Read more: Original post from Yahoo! Labs