Data mining solution by integrating Spark and Cassandra

15:00 ~ 15:45

Sala 5

Data mining solution by integrating Spark and Cassandra

Spark and Cassandra are good tools, but if you join this technologies, you'll have an excellent and efficient data mining solution.

Spark is a new cluster-computing framework that can run applications up to 40× faster than Hadoop by keeping data in memory, and can be used interactively to query large datasets with sub-second latency.
Spark provides a new primitive storage called Resilient Distributed Datasets (RDDs). RDDs let users store data in memory across queries, and provides fault tolerance without requiring replication, by tracking how to recompute lost data starting from the base data on disk. This lets RDDs to be read and written up to 40× faster than typically distributed file systems, which translates directly into faster applications.
Besides making cluster applications fast, Spark also looks forward to make them easier to write, through a concise language-integrated programming interface in Scala, a popular functional language for JVM.

Spark's main drawback is the use of HDFS (or even worse, HBASE) to build the applications and to provide a distributed data store to the RDDs. All that HDFS provides is a filesystem, and everything you may drop there is a file that will be read line by line.
Cassandra brings together the distributed system technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. In addition, unlike HDFS, C* is based in a P2P model without a single point of failure. For these reasons, C* is one of the most popular NoSQL databases outside hadoop ecosystem, but one of its handicaps is that it's necessary to model the schema on the executed queries. This is because C* is oriented to search by key.

Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.

Speakers

Alvaro Agea Herradón

Luca Rosellini

Data mining solution by integrating Spark and Cassandra

The goal

About the site