Raúl Castro Fernández

Computer Science PhD student Imperial College

Technical

Stateless dataflows MapReduce Spark

Dataflows are an omnipresent abstraction across many big data technologies due to its suitability for representing programs in a way that is easy to parallelize. All dataflow models---such as those of Spark or MapReduce---are stateless, which facilitates achieving fault tolerance, a crucial property when running at large-scale. However, this stateless dataflow models have a negative impact on the programming models they expose, which need to adapt to match the stateless nature of the underlying platforms. With the “democratization of data”, different types of users with different skills want answers from their big datasets, but sometimes they lack the skills required to write programs adapted to these specific frameworks: A familiar programming model becomes crucial to open big data value to a broader set of users.

Stateful Dataflow Graphs (SDG) are a new dataflow representation that introduces 'state' explicitly in the dataflow, e.g. programs can succinctly represent a distributed machine learning model, or a matrix used to recommend users what movies to watch. This explicit state has strong implications in the programming models exposed to users: with explicit state it becomes possible to write programs in languages such as Java, R or Julia as if they were meant to be executed in a single machine.

The key idea behind SDG is that they resemble the CFG (Control-Dataflow-Graph) of imperative-style programming languages. We exploit this similarity to build techniques that statically analyze the program code to find state accesses and opportunities to “split” computation---with just a little help from users. Once a SDG has been created, it can harness the distributed power of clusters to execute programs with high-throughput and low latency.

The talk finishes with reflections about this new programming model---its virtues and limitations---and a short discussion about a now long pursuit of “the Big Data Language”.

THANK YOU FOR AN AMAZING CONFERENCE!

Raúl Castro Fernández

Stateless dataflows MapReduce Spark

Join our Newsletter