bigdataspain.org

THANK YOU FOR AN AMAZING CONFERENCE!

The 3rd edition of Big Data Spain in Nov 2014 was a resounding success.
Watch the video below and find out why our attendees, speakers, partners and friends turned Big Data Spain into one of the largest events in Europe about Hadoop, Spark, NoSQL and cloud technologies.

ToroDB a new NoSQL database that replaces mongoDB

NoSQLEnglish
In the recent years, NoSQL databases have been gaining a lot of traction. Most of them haven been designed and written from scratch. Building on the principles of schema-less and high scalability, they offer a distinct approach to that of relational databases. But rather than re-using what the industry has learned in the last 3 decades of database development, most of these databases are re-inventing the wheel and designing the data storage layers -one of the toughest part when building a database- from scratch. Our work aims to present a database system that instead uses relational databases as well-known, durable, scalable and fast -despite what many would say- storage layers as a foundation to build a schema-less, document-oriented, scalable database. This project is named ToroDB, and its will be recently published as open-source software by BDS'14. It will effectively be the very first general-purpose database ever built in Spain.

Document databases store documents, which are basically hierarchical, nested data structures of sets of key-value pairs. Current state-of-the-art approaches to store them in relational databases is limited to storing documents in some form of binary serialization of the document (like a blob or PostgreSQL's hstore or jsonb). What our research found is a set of algorithms to transform a document into a set of document-parts that can individually be stored in relational tables, leveraging the power of relational databases. This includes dynamic creation of tables, when needed, to match a table's structure to that of the information to be stored.

The advantages of this approach are profound. There is no engineering effort required in building the storage subsystem, which should handle durability, isolation and concurrency –all of which are tough properties to implement. But even more importantly, there are very significant performance advantages, both in query time and storage savings.

Query time improves as queries targeting subsets of the documents (which are most of the queries) need only to address a subset of the data -as it is partitioned into tables- rather than reading the whole database. Storage savings are achieved by avoiding repetition of the schema of every document –many documents share the same schema (“structure”) but all them need to repeat that. Our benchmarks shows that JSON documents require in ToroDB 29% to 68% of the storage required for the same data on a MongoDB database. These means significant less I/O, significant less cost, and greater (vertical) scalability.

This presentation aims to show how the internal algorithms of this open source software, ToroDB, work. How the JSON documents are split into tables, how is this more efficient -both in terms of query time and storage savings-. Why current document-oriented databases fail to maximize the performance of BigData requirements –ToroDB also includes a mechanism for storing in columnar format parts of the documents to improve aggregate-type queries, obtaining impressive performance benefits. And, finally, how this all can be done in a compatible way with existing systems: ToroDB includes a layer that natively speaks the MongoDB protocol, hence becoming a drop-in replacement for MongoDB installations, but running on top of existing relational databases.


Join our Newsletter