Big Data Spain conference official event website

15:15 ~ 16:30

Sala 8

Carlos Gil Bellosta

Founder Datanalytics

Workshop

Workshop R and Hadoop | Big Data Spain

HIGHLY RECOMMENDED: BRING YOUR LAPTOP WITH THE VIRTUAL MACHINE INSTALLED AND TESTED AS DESCRIBED BELOW - THAT IS 3 GB!

CAVEAT: Downloading 3 GB via the Wi-Fi facilities at the venue will simply NOT be feasible - please do not try to download the Virtual Machine on the premises

This workshop is a hands-on introduction to the statistical and graphical analysis of big data using two exciting technologies: R and Hadoop.

Data analysis is a major goal in the big data roadmap. However, not all statistical analysis techniques are directly amenable to big data environments, and all of them require tweaks: data abundance comes at the expense of scarcity in terms of statistical weaponry. However, important and relevant business problems can still be addressed in big data environments making a clever use of Hadoop parallel architecture.

The workshop will illustrate a number of techniques for data modelling that help us extend our small data capabilities to the world of big data: sampling, resampling, parallelization where possible, etc. We will leverage the functional architecture of R and its statistical analysis prowess in small data environments using the mapreduce technique embedded in Hadoop to tackle large data analysis problems. Particular attention will be paid to the ubiquitous --but non-scalable-- logistic regression technique and its big data alternatives.

As an integral part of data analysis, the workshop will also pay attention to graphical representation of data. In particular, how Hadoop extends the ability of R to produce insightful graphical representations of phenomena to large scale data sets.

Take away points:

Not all statistical techniques are amenable to large data environments. However, many of them are or its limitations can be circumvented.
The functional architecture of R nicely fits in Hadoop’s mapreduce paradigm allowing for small data statistical techniques to be extended to large data environments.

SOFTWARE REQUIREMENTS

VirtualBox: https://www.virtualbox.org/wiki/Downloads
ssh: putty on Windows: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
VM: http://datanalytics.com/uploads/hortonworks_sandbox_rstudio.zip

HARDWARE REQUIREMENTS

* 4GB RAM minimum; +8GB, recommended
* 64 bit computer / OS

INSTRUCTIONS (PRE-WORKSHOP)

* Download VirtualBox (see link above) for your OS and install it.
* Download the VM (see link above) & unzip it
* Open VirtualBox and then Machine > Add (the unzipped file)
* Start the VM (and check that it does start).
* In case you find problems:
- See the "known issues" below.
- Google the error and solve it.
- Drop a line to cgb@datanalytics.com reporting your issue.

VM ACCESS

ssh access: ssh -oPort=2222 rhadoop@localhost # pwd:rhadoop
web access:
rstudio:
http://localhost:8787
u/p: rhadoop/rhadoop
hadoop job tracker:
http://localhost:50030

KNOWN ISSUES

Your virtual machine may fail to start (VMR* errors) if you fail to enable virtualization at BIOS (details are machine dependent).

Speakers

Carlos Gil Bellosta

Workshop R and Hadoop | Big Data Spain

The goal

About the site