← Back to the schedule

Spark optional: big data analysis and parallelization with Python and R – a data scientist’s perspective.

Calendar icon

Wednesday 14th

Time icon

15:35 | 16:15

Location icon

Theatre 25



The tools for big data have matured over the years and Spark has gained wide adoption. However, there are some use cases where your data is bigger than a single machine and you may not have experience with Hadoop, MapReduce, Spark or others. This talk explores approaches for dealing with “medium” sized datasets from a data scientist/data analyst perspective, or whoever is doing the analysis. We discuss Dask as an option for Python and compare it to Spark, as well as several packages and approaches with R, including SparklyR as a means to use Spark. Examples running on Azure will be presented.