HIGHLY RECOMMENDED: BRING YOUR LAPTOP WITH THE VIRTUAL MACHINE INSTALLED AND TESTED AS DESCRIBED BELOW - THAT IS 3 GB!
CAVEAT: Downloading 3 GB via the Wi-Fi facilities at the venue will simply NOT be feasible - please do not try to download the Virtual Machine on the premises
This workshop is a hands-on introduction to the statistical and graphical analysis of big data using two exciting technologies: R and Hadoop.
Data analysis is a major goal in the big data roadmap. However, not all statistical analysis techniques are directly amenable to big data environments, and all of them require tweaks: data abundance comes at the expense of scarcity in terms of statistical weaponry. However, important and relevant business problems can still be addressed in big data environments making a clever use of Hadoop parallel architecture.
The workshop will illustrate a number of techniques for data modelling that help us extend our small data capabilities to the world of big data: sampling, resampling, parallelization where possible, etc. We will leverage the functional architecture of R and its statistical analysis prowess in small data environments using the mapreduce technique embedded in Hadoop to tackle large data analysis problems. Particular attention will be paid to the ubiquitous --but non-scalable-- logistic regression technique and its big data alternatives.
As an integral part of data analysis, the workshop will also pay attention to graphical representation of data. In particular, how Hadoop extends the ability of R to produce insightful graphical representations of phenomena to large scale data sets.
Take away points:
- Not all statistical techniques are amenable to large data environments. However, many of them are or its limitations can be circumvented.
- The functional architecture of R nicely fits in Hadoop’s mapreduce paradigm allowing for small data statistical techniques to be extended to large data environments.
ssh: putty on Windows: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
* 4GB RAM minimum; +8GB, recommended
* 64 bit computer / OS
* Download VirtualBox (see link above) for your OS and install it.
* Download the VM (see link above) & unzip it
* Open VirtualBox and then Machine > Add (the unzipped file)
* Start the VM (and check that it does start).
* In case you find problems:
- See the "known issues" below.
- Google the error and solve it.
- Drop a line to firstname.lastname@example.org reporting your issue.
ssh access: ssh -oPort=2222 rhadoop@localhost # pwd:rhadoop
hadoop job tracker:
Your virtual machine may fail to start (VMR* errors) if you fail to enable virtualization at BIOS (details are machine dependent).