Real time analytics with MapReduce and in-memory
In all of these examples, live, fast-changing data sets churn in active, ongoing operations. The combination of in-memory computing and data-parallel analysis (such as MapReduce) running on a cluster of commodity servers allows these systems to continuously track and analyze this data, extract important patterns, and generate immediate feedback that steers the system’s behavior. These technologies enable the creation of an in-memory model for active entities within an operational system which continuously tracks changes to corresponding real-world entities and analyzes them in parallel. In-memory models provide a natural basis for correlating incoming events, enriching them with relevant historical information, and structuring a parallel analysis of aggregate behavior.
For example, clickstream data from a population of online shoppers can update an in-memory model of individual shoppers. Using an object-oriented approach, each shopper is represented by a memory-based object which holds a dynamic collection of time-ordered clickstream events as well as preferences and historical shopping patterns obtained from secondary storage. This object-oriented view enables incoming events to be easily correlated, and it provides the basis for a data-parallel analysis of shopping activity both to generate immediate, personalized recommendations for active shoppers and to look for emerging trends (such as determining the most popular items or the response to a sale).
This talk will describe the use of in-memory models to obtain operational intelligence in several scenarios, including financial services, ecommerce, and cable-based media. It will show both how the model is constructed and how a data-parallel analysis can be implemented to provide immediate feedback. Performance results from a simulation of 10M live cable-TV set-top boxes will illustrate how an in-memory model was used to correlate and enrich 25K events per second and complete a parallel analysis every 10 seconds on a cluster of commodity servers.
The talk also will examine the simplifications offered by this approach over directly analyzing incoming event streams from an operational system using complex event processing or Storm. Lastly, it will explain key requirements of the computing platform for an in-memory model, in particular real-time updating of individual objects and high availability, and compare these requirements to the design goals for stream processing in Spark.