The first international conference in Spain about Big Data with leading experts in data mining, data cleasing, distributing storing, cloud computing, sharing, data analysing and visualisation.Big Data is a technological challenge and a business opportunity. The conference Big Data Spain 2013 will introduce Big Data to developers and business managers in Madrid.
November 7th, 9:00AM
November 8th, 17:00PM
A Hadoop-based ETL platform for feed consolidation
GFT has built an ETL accelerator platform under Hadoop for a big international investment bank. This ETL layer is used to consolidate, enrich and validate financial data gathered from different source systems in the bank, and make it available to the Accounting Layer of the bank in an homogeneous format .
The goal of the presentation is to explain the architecture, key design decisions, main issues found, and how we solved them.
The architecture was based on the following principles:
- New mappings/transformations must be easy to develop.
- New feeds time-to-market should be reduced to the minimum. These feeds can potentially have different delivery mechanisms and data formats.
- Scalability: the application should scale horizontally in order to cope with future volumes.
- No vendor lock-in: avoid as much as possible using tools or frameworks which cannot be easily replaced by alternative ones.
With these principles in mind we developed a Hadoop based architecure with XML as the underlying format. Data Transformations are performed with XSLT. By using these standards (XML and XSLT) we made sure there is no third-party data format but, at the same time, we are able to use tools such as Altova MapForce to ease the development of the mappings/transformations.
The orchestration is based on a combination of both Oozie and Tibco BusinessWorks. The former is used for internal orchestration (within Hadoop) and the latter for external orchestration (communication with external file servers, JMS, etc.).
As part of the presentation we will also cover the main issues we found during development and productionalization of the application, namely:
- Multitenancy in the hadoop cluster, given that it is shared across projects. Special focus on both storage and processing capacity planning.
- Error handling and failure recovery.
- Logging and monitoring.
- Data security (encryption, access authorization...).
- Deployment: application and runtime data folder structure, and deployment automation.
- How to provide an easy access to intermediate data to production support (for investigation purposes).
- Disaster recovery.
- Handling incoming real-time data
- Handling reference data.
- The whole testing cycle: from unit testing to user acceptance testing
The last topic of the presentation is a description of the alternatives currently available in the market for ETL projects, like Informatica PowerCenter BigData Edition, Pentaho or Talend, and how they compare to the proposed architecture.
In summary, we will cover technologies such as Hadoop (HDFS and MapReduce), Oozie, Flume, Hive, Tibco BusinessWorks, and XML/XSLT. But we will also cover the whole Development Lifecycle: from the architecture definition until go-live.
Takeaway Points: - This proposal describes an full architecture of one of the most common use cases in banks: data consolidation.
It also provides solutions for many key issues found during the development and productionalization of a Hadoop based architecture.