A Hadoop-based ETL platform for feed consolidation

14:30 ~ 15:15

Sala 5

Ignacio Sales Saborit

Senior Software Architect GFT Group

Technical

A Hadoop-based ETL platform for feed consolidation

GFT has built an ETL accelerator platform under Hadoop for a big international investment bank. This ETL layer is used to consolidate, enrich and validate financial data gathered from different source systems in the bank, and make it available to the Accounting Layer of the bank in an homogeneous format .

The goal of the presentation is to explain the architecture, key design decisions, main issues found, and how we solved them.

The architecture was based on the following principles:

- New mappings/transformations must be easy to develop.
- New feeds time-to-market should be reduced to the minimum. These feeds can potentially have different delivery mechanisms and data formats.
- Scalability: the application should scale horizontally in order to cope with future volumes.
- No vendor lock-in: avoid as much as possible using tools or frameworks which cannot be easily replaced by alternative ones.

With these principles in mind we developed a Hadoop based architecure with XML as the underlying format. Data Transformations are performed with XSLT. By using these standards (XML and XSLT) we made sure there is no third-party data format but, at the same time, we are able to use tools such as Altova MapForce to ease the development of the mappings/transformations.

The orchestration is based on a combination of both Oozie and Tibco BusinessWorks. The former is used for internal orchestration (within Hadoop) and the latter for external orchestration (communication with external file servers, JMS, etc.).

As part of the presentation we will also cover the main issues we found during development and productionalization of the application, namely:

- Multitenancy in the hadoop cluster, given that it is shared across projects. Special focus on both storage and processing capacity planning.
- Error handling and failure recovery.
- Logging and monitoring.
- Data security (encryption, access authorization...).
- Deployment: application and runtime data folder structure, and deployment automation.
- How to provide an easy access to intermediate data to production support (for investigation purposes).
- Disaster recovery.
- Handling incoming real-time data
- Handling reference data.
- The whole testing cycle: from unit testing to user acceptance testing

The last topic of the presentation is a description of the alternatives currently available in the market for ETL projects, like Informatica PowerCenter BigData Edition, Pentaho or Talend, and how they compare to the proposed architecture.

In summary, we will cover technologies such as Hadoop (HDFS and MapReduce), Oozie, Flume, Hive, Tibco BusinessWorks, and XML/XSLT. But we will also cover the whole Development Lifecycle: from the architecture definition until go-live.

Takeaway Points:
- This proposal describes an full architecture of one of the most common use cases in banks: data consolidation.
It also provides solutions for many key issues found during the development and productionalization of a Hadoop based architecture.

Speakers

Esteban Chiner Sanz

Ignacio Sales Saborit

A Hadoop-based ETL platform for feed consolidation

The goal

About the site