Stateless dataflows MapReduce Spark
Stateful Dataflow Graphs (SDG) are a new dataflow representation that introduces 'state' explicitly in the dataflow, e.g. programs can succinctly represent a distributed machine learning model, or a matrix used to recommend users what movies to watch. This explicit state has strong implications in the programming models exposed to users: with explicit state it becomes possible to write programs in languages such as Java, R or Julia as if they were meant to be executed in a single machine.
The key idea behind SDG is that they resemble the CFG (Control-Dataflow-Graph) of imperative-style programming languages. We exploit this similarity to build techniques that statically analyze the program code to find state accesses and opportunities to “split” computation---with just a little help from users. Once a SDG has been created, it can harness the distributed power of clusters to execute programs with high-throughput and low latency.
The talk finishes with reflections about this new programming model---its virtues and limitations---and a short discussion about a now long pursuit of “the Big Data Language”.