Big Data Spain

15th ~ 16th OCT 2015 MADRID, SPAIN #BDS15


THANK YOU FOR AN AMAZING CONFERENCE!


THE 4th EDITION OF BIG DATA IN Oct 2015 WAS A RESOUNDING SUCCESS.

REAL-TIME ANOMALY DETECTION WITH CASSANDRA, SPARK ML AND AKKA

Friday 16th

from 17:15 pm to 18:00 pm

Room 25

-

Technical

Banks are innovating. The purpose of this innovation is to transform bank services into meaningful and frictionless customer experiences. A key element in order to achieve that ambitious goal is by providing well tailored and reactive APIs and provide them as the building blocks for greater and smoother customer journeys and experiences. For these API’s to work, internal processes have to evolve as well from batch processing to real time event processing.

Read more

In this talk, after providing a brief introduction of the streaming computing landscape, we describe a RESTful API called “Coral” meant to design and deploy customized and flexible data flows as a Web Service. The user can compose data flow for a number of data streaming goals such as on-the-fly data clustering and classifiers, streaming analytics, per-event predictive analysis , real time recommenders. Once the events are processed, Coral passes the resulting analysis as auctionable events for alerting, messaging or further processing to other systems. Coral is a flexible and generic event processing platform to transform streaming data into actionable events via a RESTful API. Those data flows are defined via the Web API by connecting together basic streaming processing elements named “coral actors”. The Coral framework manages those coral actors on a distributed and scalable architecture.

Streaming and real time data processing and analytics are the key elements to an improved customer experience. In this way, you can get the most targeted processing for your domain (marketing customization, personalized recommenders, fraud detection, real time security alerting, etc.). This streaming “data flow” model implies processing customers’ events as soon as they enter via web APIs. This approach borrows a lot from distributed “data flow” concepts developed for processor architectures back in the 80’s. The “Coral” streaming processing engine is generic and built on top of world class libraries such as Akka and Spark, and fully exposed via a RESTful web API.

In this talk we will detail on how to combine Akka, Cassandra, and Spark in order to provide a real-time, and streaming anomaly detection system. In the system presented, we run Spark to run the data analytics pipeline for anomaly detection. By running Spark on the latest events and data, we make sure that the model is always up-to-date and that the amount of false positives is kept low, even under changing trends and conditions. Our machine learning pipeline uses Spark decision tree ensembles and k-means clustering. Data events are collected in Cassandra and extracted to Spark to perform the machine learning analytics.

Once the model is trained in Spark, the model’s parameters are stored back in a model table in Cassandra. The model’s parameters available in Cassandra are then accessed by our Streaming Event Processing Layer, implemented in Akka. The Akka layer will then score 1000s of event per seconds according to the last model provided by Spark. Spark and Akka communicate which each other using Cassandra as a low-latency data store.

By doing so, we make sure that every element of this solution is resilient and distributed. Spark performs micro-batches to keep the model up-to-date while Akka detects the new anomalies by using the latest Spark-generated data model. Data and Models are both persisted in Cassandra.

Natalino Busa foto

Natalino Busa

INGData Architect