← Back to the schedule

Keynote | Technical

Relational is the new Big Data

Thursday 16th | 13:40 - 14:10 | Theatre 20

One-liner summary:

Relational databases were the persistence system of choice for decades, until the Web 2.0 in the 2000s required to process volumes of data so big it needed distributed systems running in parallel. A new type of databases (NoSQL) was adopted to solve this problem in different ways. We are now seeing the pendulum swing back, with some relational databases evolving into systems that can easily be made distributed, keeping their versatility, simplicity in structure and easy infrastructure maintenance. We will showcase how Citus can be combined with PostgreSQL to work with distributed data.

Keywords defining the session:

- Relational

- Distributed

- Postgres Citus

In this talk we will start by explaining how relational databases (RDBMSs) were created in the 70s, replacing the previous hierarchical IMS systems, and developed in the next 40 years, being the persistence systems of reference during that time. In the 2000s, the arrival of the Web 2.0 brought an explosion of the volume of data to persist, analyze and process – what was to be called Big Data. The number of users was now in the hundreds of millions or billions, amounts of data stored went up to Tera and Petabytes and the data persisted had to be queried and analysed within seconds, enabling real-time experiences for the user. Scaling relational databases horizontally and keeping transactions ACID was proven to be a very hard process. We will talk about some of the problems to face when sharding relational data into different machines, like deadlocks, two-phase commits or parallelization. To solve these problems, a few new paradigms emerged and were developed, which resulted in new ways to persist and process big volumes of data, what was called NoSQL databases. We will describe some of these paradigms, like Big Table from Google or LinkedIn Voldemort key-value store. We will also cover the main types of NoSQL databases following the CAP theorem and main differences between them. At this point companies rushed into exploring these new systems, using these technologies to build new products that were highly performant and scalable, solving their Big Data needs. Some of these companies had millions or billions of users, and dedicated whole teams to architect and maintain NoSQL databases and other distributed technologies like Hadoop and Spark. But these systems can be very complex to set up and maintain. The amount of human and technical resources required to properly having them running is not trivial, and this can be problematic in a lot of companies that can not dedicate this level of resources just to have their data pipeline up and running. Another drawback of NoSQL databases is that they are very specific in the way they function and the problem they solve. So even if cloud providers now offer implementations of NoSQL databases that they run, monitor and scale, the team of developers writing the applications using those databases have to understand exactly how they will be configured and set up, and a change in the model or the way the application works (even changing the structure of the queries) might require starting from scratch.