TALKREDUCE() - Technical | Big Data Spain

Importance of collaboration among Programmers and Data Scientists

Bussines

El gran auge de arquitecturas distribuidas, tales como Spark, está haciendo posible la creación de nuevas y avanzadas aplicaciones de toma de decisiones basadas en el procesamiento de datos en tiempo real (Streaming Analytics). En el desarrollo de estas nuevas aplicaciones aparece una “nueva” figura (o antigua depende el enfoque realizado), el denominado científico de datos.

Su principal rol, aunque no el único), es la generación del modelo que permita la extracción de conocimiento a partir de los datos existentes. Por otra parte, existe el desarrollador de soluciones o equipo de desarrollo que se ocupa de integrar el modelo/s desarrollados en la solución o software preexistente o en el software nuevo a realizar. La colaboración de ambos es vital para cerrar en tiempo y presupuesto el desarrollo de esa nueva herramienta de toma de decisiones. Es aquí donde aparece una primera problemática consistente en que lenguaje utilizar. Muchos de los científicos de datos utilizan lenguajes como Matlab, Mathematica, R o en el mejor de los casos Python, mientras que actualmente los desarrollos en empresa están en Java, .Net, JavaScript, C++ o, en mucha menor medida, en Python. Aquí, se inicia la primera barrera de comunicación con el lenguaje de programación, esta barrera se acrecienta cuando son soluciones legadas, muchas de ellas Java, que hace necesario la integración de lenguajes a través de herramientas tipo Web Services u otros métodos más “exotéricos” (quién no ha realizado una integración usando un ejecutable y generando un fichero). Otra barrera actual es el contenido de lo que se habla, unos hablan de modelos de Redes Neuronales, Support Vector Machines y de modelos de Análisis del Lenguaje y otros de Estructuras de Datos, Patrones de Diseño y Orientación a Objetos.

Solución
Es importante ver cómo se pueden destruir estas barreras, una solución que plantean algunos autores es crear el Super- Científico de Datos es decir una persona que sepa de Matemáticas, Inteligencia Artificial, Sistemas Distribuidos y Desarrollo de Software. Nosotros planteamos una alternativa, la necesidad de buscar una forma de colaborar teniendo a cada persona dentro de su ámbito de experiencia. El primer punto a solucionar sería el lenguaje, gracias a Spark es posible plantear un desarrollo de soluciones basadas en Java/Scala desde el punto de vista del científico de datos, siendo éste un lenguaje en el cual las empresas tienen una gran experiencia y sienten confianza. Por otra parte, con objetivo de reducir la curva de aprendizaje, se puede dar al sistema una capa gráfica de construcción de modelos en base a trasformaciones (al igual que se tiene en SAS o Rapidminer), pero desde un modelo Open Source. Por otra parte, es posible ofrecer una capa de generación de servicios en base a esas transformaciones, y conseguir de esta forma una arquitectura orientada a Servicios, construyendo de esa manera una arquitectura no SOA (Software Oriented Architectures) sino AOA (Algorithm Oriented Architecture). Finalmente, si se posee una herramienta Open-Source que permita generar ambas cosas, se podrá disponer de una herramienta que permitirá colaborar al Científico de Datos con el Desarrollador de Aplicaciones.

Rafael del Hoyo

ItainnovaLecturer in Artificial Intelligent

Jorge Vea Murguía

Technological Institute of Aragon Big Data Specialist

Predicting failures on complex machines

Business

Complex machines, e.g. trains or wind turbines, require very solid maintenance procedures. Anticipating the wear of a piece or the failure of a system allows a sensitive maintenance scheduling and prevention of catastrophic failures. The race towards efficiency has enabled the spreading of sensors that collect huge data about the current state of the different components of said machines. Collecting and storing this data can be considered a solvable problem. However, all that data is of no use by itself. An optimal maintenance can derive from decisions that can derive from information that can derive from that big lake of data.

Therefore, NEM Solutions offers to the clients not only knowledge and consulting services on the machines they build or manage, but also software tools that are capable of extracting information from data and assisting in decision making. Working as we are with many complex machines, that means Big Data. Building such a final-client-oriented Big Data application faces several challenges: Knowing the data, understanding the client's needs and being able to develop a solution that squeezes out information from the data in an effective, intelligent and usable way. It is technically challenging. This talk lays out the scenario in which our company is totally immersed: Besides data monitoring and shiny graphics, we need a deeper layer of computational intelligence. The goal is to predict malfunctions and performance issues in more than 20K complex machines several weeks or even months ahead of catastrophic failures. We solve this by transforming the path that data-samples walk into an Apache Storm topology, persisting in HBase and using Kafka as a decoupling tool. The costly computation is split into small semantically sensible pieces, from which we build a highly complex topology. Executing such a complex topology is not trivial. The management of thousands of nodes interchanging millions of messages is error prone, computationally expensive and leaves much room for improvement. This presentation shows how we were able to translate this huge computation problem into a scalable and efficient component of our Big Data solution -building an efficient tool using Apache Storm, Kafka, HBase and Redis. The final result is an application on which the client can dig deep and see that there are actual intelligent machine learning processes running and giving valuable output in the form of: "The rotor will fail in a couple of weeks, let’s plan the maintenance schedule accordingly."

Ion Marqués

NEMSolutionsData Scientist

Supporting Data Analytics with Spark Hierarchies

Technical

What does a file system, the organization of a country in states, counties, cities, and the assembly process of car parts have in common? The answer is simple: Hierarchies. Modelling data as hierarchies is an intrinsic requirement of data analysis as it easily allows to perform complex computations and aggregations on different levels or dimensions of data.