from 12:40 to 13:20
Approach to processing large dataset has a few peculiarities due cost and time limitations. In this talk, I shall share lessons learned during exploitation of petabyte-scale Hadoop data processing systems. We shall talk about data model design, development process, and operational support decisions taken in historical context, and evaluated in retrospect.
(1) How to build a reliable Continuous Integration/Continuous Delivery pipeline?
(2) How to design user behavior data store in a way to support agile workflow of a development team and facilitate collaboration?
(3) How to build data infrastructure and ensure on-time delivery of computation results in a multi-tenant environment?
Large data corpus significantly influences answer on these questions.
The most widely known approach with separate environments for development, quality assurance, and production works far from perfect. Management of multiple Data Lakes and their synchronization significantly slows down development and infrastructure evolution.
A natural approach with usage of NoSQL databases to store user history in a key-value storage does not perform as well. Maintenance of mutable data structures in HBase and Cassandra wastes large share of cluster resources and slow down not only storage, but also retrieval procedures. The databases tend to produce random disk access patterns those make multi-tenancy hardly achievable.
Reliable and flexible data processing requires a different approach. Large datasets represent system state which is hard and expensive to manage in distributed systems. An architecture following functional principles is the easiest to maintain.
EPAM Systems LtdSolution Architect