Back to the program

Multiplatform Spark solution for Graph datasources

Thursday 17th

from 14:50 to 15:30

Theatre 19

Keynote

One of the top banks in Europe, needed a system to provide better performance, scaling almost linearly with the increase in information to be analyzed, and allowing to move the processes that were currently being executed in the Host to a Big Data infrastructure. During a year we've worked on a system which is able to provide greater agility, flexibility and simplicity for the user to view information when profiling and is now able to analyze the structure of profile data. It's a powerful way to make online queries to a graph database, which is integrated with Apache Spark and different graph libraries. Basically, we get all the necessary information through Cypher queries which are sent to a Neo4j database.

Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata. We have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases. This is thanks to a different distributed processes over the datasource which main objective it's the dynamic generation of Spark DataFrames with a general schema that suits perfectly for all kind of data structures stored in the Graph database. This process it's a simple and straight forward solution that solves the translation of a Graph database with multiple and different structured entities to a Graph library, and the problem of querying a massive database without timeouts.

The results of this solution, it's a couple of Spark Dataframes which represent the vertex and edges of our datasource. The schema used in both dataframes it's the key that let them be integrated in graph libraries like GraphX or GraphFrames, some of the most used and distributed graph tecnologies in Big Data. Users with technical knowledge about Big Data can take advantage of this, to filter or query the data inside the graph and use the result of it to generate dashboards or Data Science reports from massive data.

For this event we will show how we can extract data from a Graph Database like Neo4j and obtain perfectly suited Graph Dataframes through a Data Science tool like Stratio Intelligence. An example of how we have applied machine learning over a Neo4j database and finally, a code snippet to understand how we can query and filter our datasource, making use of some powerful graph algorithms like Connected components or Page Rank to obtain valuable information through distributed tasks.

Multiplatform Spark solution for Graph datasources by Javier Dominguez de Big Data Spain

Javier Domínguez

StratioData Science team