Spark optional: big data analysis and parallelization with Python

← Back to the schedule

Marck Vaisman

Microsoft

Spark optional: big data analysis and parallelization with Python and R – a data scientist’s perspective.

Wednesday 14^th

15:35 | 16:15

Theatre 25

Technology

Description:

The tools for big data have matured over the years and Spark has gained wide adoption. However, there are some use cases where your data is bigger than a single machine and you may not have experience with Hadoop, MapReduce, Spark or others. This talk explores approaches for dealing with “medium” sized datasets from a data scientist/data analyst perspective, or whoever is doing the analysis. We discuss Dask as an option for Python and compare it to Spark, as well as several packages and approaches with R, including SparklyR as a means to use Spark. Examples running on Azure will be presented.

MEDIA

Keynote

TALK

Spark optional: big data analysis and parallelization with Python and R – a data scientist’s perspective.

MEDIA