Cloud Process History: Shaking your Data into the Cloud

1. Business transformation Cloud Process History is a Software as a Service (SaaS) belonging to CEPSA’s digital catalogue and designed using the latest Cloud technology, which is used for the capture, storage, visualization and advanced real-time analysis of relevant information from industrial facilities or corporative data. Main Features:

 

a) Ingestion, processing and storage Ingestion, enrichment and storage with unlimited capacity of event typologies and/or devices that emit information with disparate structure and content. Product management web (sensors, hierarchies, sites…) Dynamic collection of process metadata, sensor locations and units Configurable and unlimited data retention capacity

 

b) Dynamic configuration and enrichment Parameterized configuration via Web of event typologies, process units and locations. Product management web (sensors, hierarchies, sites…) Dynamic collection of process metadata, sensor locations and units Configurable and unlimited data retention capacity

 

c) Monitoring, Data Quality and Real Time Visualization Real Time Monitoring, Time Series and Dashboard Display, Data Governance and Quality. Customized real-time event display panels (time series) System and Infrastructure Health Monitor Automatic system alarms and notifications Data quality rules and alarming engine

 

d) Advanced Analytics and DataLabs Connection to analytical platforms oriented to data scientist. Creation of scalable laboratory environments for high volumetric information. WorkSpaces for business analysts and standard users. Connectors and Rest API services to advance analytics platforms such as RapidMiner High performance working environment for data exploration and machine learning Machine Learning Model Execution Environment Real Time Consumption via API Gateway

 

2. Innovation The business is actually driven by data. Big data gives us a new opportunity for decision making without the uncertainty of what will happen in the future. With CPH you can easily connect your industrial sensors with the data world. With this you can see everything that happens in your industry plants, and relate data that you never seem to be related. You can monitor all the information ingested and be aware of all the problems that occur in the real world from only one place. Fast and powerfull engine.

 

The CPH engine ingests events in less than 1 second, using the Real time ingestion. It has a streaming processing system together with a database oriented to real time workloads. But it has different ways to ingest data. You caningest data depending on the needs of your industry. Real Time, Near Real Time and Batch processing of data and options that you have in CPH. Easy to connect: CPH is easy to connect different visualization tools that are used by the different companies in the data world. You can use the tool that you prefer to make the learning much easier. If your tool is not in the CPH system, you can connect it thought the JDBC connector available.

 

3.- Tecnology Cloud Process History (CPH) is a solution developed under Cloud Technology, with a low cost architecture that solves the challenges of mass historification of data generated mainly by industrial facilities, making them available through direct consumption applications (APIs), visualization tools and datalabs to develop algorithms. It also has components that enable connection to advanced analytical platforms (RapidMiner, KNIME, etc) Benefits Scalability – CPH is built on the technological infrastructure of Amazon Web Services, taking advantage of scalability in processing, unlimited storage capacity and pay-per-use. Data Ingestion – Cloud Process History allows real-time ingest of all types of events using factory and computer protocols. It also has a module for the processing and enrichment data stored. Democratization – CPH allows free access to data, especially for business analysts, plant engineers and data scientists.

A Humane Guide to Graph Database Internals

Databases are everywhere, but did you ever wonder what goes on inside the box? In this talk we’ll dive into the internals of Neo4j – a popular graph database – and see how its designers deal with common functional and non-functional requirements. In particular we’ll see how data is stored safely and queried performantly by understanding the way Neo4j makes use of the network and file system. So if you’re a curious person looking for a humane and light-hearted introduction to database internals and distributed systems, this talk is for you!

The Rise of Voice Technologies

Machine Learning (ML) is separated into model training and model inference. ML frameworks typically use a data lake like HDFS or S3 to process historical data and train analytic models. Model inference and monitoring at production scale in real time is another common challenge using a data lake. But it’s possible to completely avoid such a data store, using an event streaming architecture. This talk compares the modern approach to traditional batch and big data alternatives and explains benefits like the simplified architecture, the ability of reprocessing events in the same order for training different models, and the possibility to build a scalable, mission-critical ML architecture for real time predictions with muss less headaches and problems. The talk explains how this can be achieved leveraging Apache Kafka, Tiered Storage and TensorFlow.

Building Notebook-based AI Pipelines with Elyra and Kubeflow

A typical machine learning pipeline begins as a series of preprocessing steps followed by experimentation, optimization and model-tuning, and, finally deployment. Jupyter notebooks have become a hugely popular tool for data scientists and other machine learning practitioners to explore and experiment as part of this workflow, due to the flexibility and interactivity they provide. However, with notebooks it is often a challenge to move from the experimentation phase to creating a robust, modular and production-grade end-to-end AI pipeline. Elyra is a set of open-source, AI centric extensions to JupyterLab. Elyra provides a visual editor for building notebook-based pipelines that simplifies the conversion of multiple notebooks into batch jobs or workflows. These workflows can be executed both locally (during the experimentation phase) and on Kubernetes via Kubeflow Pipelines for production deployment. In this way, Elyra combines the flexibility and ease-of-use of notebooks and JupyterLab, with the production-grade qualities of Kubeflow (and in future potentially other Kubernetes-based orchestration platforms). In this talk I introduce Elyra and its capabilties, then give a deep dive of Elyra’s pipeline editor and the underyling pipeline execution mechanics, showing a demo of using Elyra to construct an end-to-end analytics and machine learning pipeline. I will also explore how to integrate and scale out model-tuning as well as deployment via Kubeflow Serving.

Predicted risk of future chronic diseases

Medicsen presented in last years BIG THINGS conference regarding a glucose predictive software that generated plenty of interest. Since then, we have created new functionalities based on relevant tech innovations to improve healthcare outcomes and satisfaction. We have created a Software that predicts the risk of a certain patient currently having a disease and it ́s future likelihood of developing a chronic condition (initially focused in diabetes), all based on GDPR compliant data and oriented towards improving patient ́s health and provider ́s cost of services. Optimize business and patient outcomes • Cost savings (Understand workfrow and reduce risk) • Differential value of service (personalized approach) • New business opportunities (predicted future costs) Improve patient´s health and satisfaction, Reduce cost of care, and open new business opportunities With an algorithm that tracks data in real time, predicts future problems and offers valuable insights and recommendations through a multi-platform interactive dashboard. Better understanding workflows, calculating total cost of problems and efficiently prioritizing interventions by their return on investment potential. Accurate, safe and affordable. EXPERT ANALYSIS + PREDICTIVE ALGORITHMS + SUPREME DATA VISUALIZATION • CLASSIFICATION of patients in risk groups according to their current health state and potential of future chronic conditions arising • PREDICTION of likelihood of future disease based on GDPR compliant data, non biometrical. • Integration with ANALYSIS of business variables (costs, times …) • MEDICAL ACTS as the only necessary data-points • ADAPTED to the particularities of the health provider • Graphic and interactive DISPLAY of the information through an online dashboard that can be accessed from multiple platforms. We know that GDPR is a real liability for these companies, so we added a differential value point, which is that our technology DOES – NOT – USE – BIOMETRICAL – DATA. This means that we don ́t need to know the results of a blood test, only to know that the patient took it, so We only need data that healthcare companies CAN use without compliance teams having trouble. If additional data is available, we can use it, but it is not required. OBSERVE: Automatic data tracking connected to the current databases of the healthcare provider to create an enriched layer of aggregated data and a platform to visualize (dashboard), hosted on the cloud and working on any device. DISCOVER: • Unbury hidden insights that drive patient´s health and cost generation • Classify patients in groups according to the most relevant variables for each case • Individual patient or patient group health risk score and costs. Likelihood of having a certain disease PREDICT: • Model the complex interplay of disease progression and service utilization to Forecast the future progression of individual and group risk and costs, anticipating chronic illnesses and consumption of services. • Who might get sick? Predict population health based on future clinical patterns • Disease detection/prediction algorithms DECIDE: • Obtain recommendations and insights to guide decisions on: o Optimal policy to insure o Creating new plans for new verticals of clients o Optimizing resource allocation among patients and centers o Interventions to reduce future patient and group risk and costs Medicsen has validated this technology with international healthcare providers and now we are working on standardizing data access and analysis to decrease cost of adaptation.

Foundations of Data Teams

Successful data projects are built on solid foundations. What happens when we’re misled or unaware of what a solid foundation for data teams means? When a data team is missing or understaffed, the entire project is at risk of failure. This talk will cover the importance of a solid foundation and what management should do to fix it.

Contact Tracing and Digital Epidemiology: a scientific and ethical review

Over the past several months, Plaiground, Minsait’s new AI company, has led the design of different alternatives to use geo-positioning and contact tracing technology to prevent the chains of transmission of COVID19. The best known of these initiatives has been the development of this Contact Tracing app. On these speech we will describe the ethical dilemmas between privacy and security, the areas of improvement and the lessons learned for the future.

How to Mitigate Adversarial Attacks to Computer Vision Systems

CNN, specialized neural networks for Computer Vision tasks, are used in sensitive contexts and exposed in the wild. While extremely accurate, they are also sensitive to imperceptible perturbations that can’t be detected by human eyes. For this reason they have been targeted by hackers which implemented AI-based techniques for their malicious purposes. During the presentation I am going to explain defense strategies to mitigate the effect of such attacks and make neural networks more robust to them, while at the same time keeping minimal impact on the accuracy of the model and implementation costs.

Introduction to data streaming

While “software is eating the world”, those who are able to best manage the huge mass of data will emerge out on the top. The batch processing model has been faithfully serving us for decades. However, it might have reached the end of its usefulness for all but some very specific use-cases. As the pace of businesses increases, most of the time, decision makers prefer slightly wrong data sooner, than 100% accurate data later. Stream processing – or data streaming – exactly matches this usage: instead of managing the entire bulk of data, manage pieces of them as soon as they become available. In this talk, I’ll define the context in which the old batch processing model was born, the reasons that are behind the new stream processing one, how they compare, what are their pros and cons, and a list of existing technologies implementing the latter with their most prominent characteristics. I’ll conclude by describing in detail one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map. I’ll go through the all the requirements and the design. If time allows, I’ll try to impress attendees with a demo.

JavaScript, Neurology, & Machine Learning

An important skill people in tech need to have is learning how to combine unusual, unrelated concepts into things real people can use every day. JavaScript is the most popular programming language in the world and it has the capabilities to handle massive data streams quickly. It’s already being used in machine learning so why not take it to the next level and add neurology? You can learn how a $500 EEG sensor module can be used to improve internet and technology accessibility for the disabled. Using machine learning, we can make predictions on what specific brain signals mean and map them to real world problems like controlling a car with your mind or being able to change the format of a website so you can peruse it without the use of an assistive reading device. By the end of this talk, attendees will know how you can use Node.js to handle brain signals and use them in an a way that makes huge differences for users.

Next-Generation Data Warehouses: Building a Spatial Data Science Stack

Location data is being generated at an unprecedented rate and right now is even more important as we navigate the new normal: a world where social distancing, contact tracing & spatial data analysis are critical for governments to respond rapidly and for businesses to survive. With a growing number of Data Scientists being asked to create spatial models for their organizations, it’s important to consider how their stack of data warehouses and AI / machine learning platforms can come together to deliver accurate & relevant predictions. In this session, Javier de la Torre (founder and Chief Strategy Officer of CARTO) will look at real examples of Spatial Data Science in action in verticals such as cities and government, retail and CPG. He will show how new types of Big Data such as human mobility, credit card transactions and social media data (among others) can help the organizations of tomorrow respond to changing citizen and consumer behavior through effective location-based market analysis.

The COVID-19 impact on Consumption in Real Time and High Definition

Observing the evolution of the economy in real time and high definition can be essential when evaluating the economic impact of an event with few precedents in the world economy like the current Covid19 crisis. Although it is not the first or the last time we are faced with a disaster, expected or unexpected, the characteristics of this one make it almost unique. That is why an accurate diagnosis of “what, how, when and where” is quite important. The analysis of granular data in real time with Big Data helps us precisely to solve most of these questions and, consequently, allows economic agents, institutions and policymakers to have quick and precise answers for decision making. Every day, banks, payments systems providers, and other financial intermediaries record and store massive amounts of individual transaction records arising from the normal course of economic life. Financial and payments systems throughout the world generate a vast amount of naturally occurring, and digitally recorded, transaction data, but national statistical agencies mainly rely on surveys of much smaller scale for constructing official economic series. This paper considers data transactions from credit- and debit card data from BBVA, the second largest bank in Spain and also with a major market presence in numerous other countries, as an alternative source of information for measuring consumption, a key component of GDP. Particularly, we process more than 6 billion transactions collected from BBVA cardholders and BBVA-operated point-of-sale terminals from seven countries, 2.1 billion of which arise in Spain. We analyze the data along three different dimensions: as a coincident indicator for aggregate and subnational consumption; as a detailed household consumption survey; and as a mobility index. While card spending growth is more volatile than non-durable consumption growth, normalized spending correlates strongly with official consumption measures. In the cross section, patterns in card spending match those in official household budget surveys very closely. The implication is that card spending can stand in for consumption surveys in environments where official data is not available, for example due to reporting delays or to insufficient geographic or household detail. We apply the idea of card spending as a consumption survey to the COVID-19 crisis in Spain, where we present four findings: (1) a strong consumption reaction to lockdown and its easing at the national and regional levels; (2) a rapid, V-shaped consumption recovery in the aggregate; (3) an adjustment to the average consumption basket during lockdown towards the goods basket of low-income households; (4) a divergence in mobility patterns during lockdown according to income in which poorer households travel more during the workweek. Exploring the relation between mobility and disease incidence, we show that in the absense of direct mobility proxies, card transactions in transportation categories can be used as a mobility proxy at narrow geographical and socioeconomic status levels of analysis. Therefore, our main conclusion is that transaction data provides high-quality information about household consumption, which makes it a potentially important input into national statistics and research on household consumption, as well as for the business and policymakers, to have rapid and accurate responses about what is happening in real time and measure the impact of COVID-19 and the policy interventions made to limit its spread.

Fine-tuning language models for spanish NLP tasks

Natural Language Processing (NLP) is nowadays one of the main points of focus of artificial intelligence and machine learning technologies. While conversational agents such as Siri or Alexa are the most visible representatives of NLP, this field finds wide applications in search engines, chatbots, customer service, opinion mining, and so on. The high levels of success that such NLP solutions have achieved in recent years are mostly fueled by three factors: the public availability of very large datasets (corpora) of web text, the fast upscaling in specialized hardware capabilities (GPUs and TPUs), and the improvements of deep learning models adapted for language. Focusing on this last point, the so called “language models” have been proven to be quite effective in leveraging the large datasets available. A language model is a deep artificial neural network trained on unlabeled corpora with the aim of modelling the distribution of words (or word pieces) in a particular language. In this way, and while trained in an unsupervised fashion, a language model is able to perform NLP tasks such as filling gaps in sentences or generating text following a cue. Furthermore, large language models such as GPT-2, RoBERTa, T5 or BART have proven to be quite effective when used as foundations to build supervised models addressing more specific or downstream NLP tasks like text classification, named entity recognition or textual entailment. Further specialized language models such as DialoGPT, Pegasus or Mbart have presented even better results for complex tasks such as context-free conversation, summarization and translation. And the extremely large model GPT-3 has presented impressive results in a wider variety of NLP tasks while being trained in a purely unsupervised manner. However, most of the language models that are available as open-source tools focus solely on the english language. While models for other languages do exist (BETO, CamemBERT, RobBERT, GreekBERT, …) they are usually trained on smaller corpora than english models, thus producing lower quality results. Multilingual versions of some of the most popular language models do exist, though they usually underperform when compared to monolingual models if tested on tasks other than machine translation. As an interdisciplinary team of experts in data science and computational linguistics, in this talk we will present our experience in applying language models to solve NLP tasks in a language different than english: spanish. Although spanish is currently spoken by about half a billion people in the world, it falls way behind english in the amount of NLP resources available. The most frequently used spanish language model is BETO, trained on the Spanish Unnanotated Corpora (SUC). While this spanish-only model produces better results than multilingual models, there is still plenty of room for improvement when compared to english models. We will present how we have re-trained BETO using a larger corpus to improve results in downstream NLP tasks. In particular, we have made use of the OSCAR corpus, about 9 times larger than SUC, together with some semi-automated corpus cleaning strategies to improve the BETO model. Results on a variety of text classification and named entity recognition tasks have shown that this approach is practical and effective to produce a better fit.

The Data Cloud: A Vision for Data in a Multi-Cloud World

You’ve heard of the infrastructure cloud provided by AWS, Azure, and GCP because it has revolutionized the way organizations run the servers and tools that power their business. You’ve heard of the application cloud provided by companies such as Salesforce, Workday, and ServiceNow that’s brought the “front office” to the cloud, with massive efficiencies of scale along the way. Now we are seeing the emergence of a third layer, one devoted to data, analytics, and cross-cloud collaboration: the Data Cloud. The trends that have brought us the Data Cloud should come as no surprise. As the adoption of both the infrastructure and application clouds has accelerated, businesses have been left with massive amounts of data spread across multiple different systems and cloud provider platforms. It is common for large corporations to have operations and data sitting in two or more infrastructure clouds like AWS, Azure, and Google, along with even more data in applications like Salesforce, Workday, Coupa and ServiceNow. This has left them looking for a way to un-silo, govern, and secure all of this private data. What’s more, there is increasing demand from within the business to actually use this data for analytics, business intelligence, machine learning and artificial intelligence. Running these workloads on top of unified and secured cross-cloud data is the vision for the Data Cloud. It all sounds promising, but what would you need in order to accomplish all of this? First of all, you need a way to combine and unify disparate data across multiple cloud platforms. Second, you would need to be able to support data of all types, including structured and semi-structured data. Third, you’d need to be able to run any number of workloads on top of your unified data in order to actually put this valuable resource to work. One of the most important workloads would be the ability to easily and securely share all of your newly unified data with trusted partners or customers, and augment it with outside data from third-party providers. If an organization were able to accomplish all of that, the potential benefits could be enormous. Although all data is valuable, it becomes infinitely more so when it is combined with other contextual data sets. To use a pertinent example, imagine having all of your product usage data in a data lake on AWS. Separately, you would likely have data about your customers in Salesforce and Marketo. Combining these datasets would enable you to create an end to end vision of the customer life-cycle that simply isn’t possible with the data in disparate silos. Having that vision of the full customer lifecycle enables the types of insights that drive new revenue streams, efficiencies, and a better customer experience. Of course, that is just a single example of the way that the Data Cloud can drive value. These are some of the concepts that we will explore in this presentation. You’ll walk away with a deep understanding of the future of data in a multi-cloud world. You’ll see how the world’s most data-centric organizations are embracing the Data Cloud and the potential it provides. Last but not least, and how you can take maximum advantage of the opportunity the Data Cloud presents you with.

Technically Right, Effectively Wrong: The Data Product No Customer Wants

From Gartner and VentureBeat to NewVantage, the surveys say the same thing: ~85% of analytics, big data, and AI projects will fail, despite massive $ investments. Why are customers and employees not engaging with these data products and services? Often, they weren’t designed around user needs, wants, and behavior. A “people first, technology second” approach can minimize the chance of failure and drive your analytics/AI/data/product team to create innovative and indispensable software solutions.

Federated Machine Learning con TensorFlow

Los sistemas “tradicionales” de Aprendizaje Automático requieren un proceso de entrenamiento centralizado que requiere, además diferentes sistemas de recolección y procesamiento de la información los cuales suelen incrementar la complejidad del proceso de aprendizaje y en muchos casos limitan la obtención de información debido a la falta de privacidad, el coste de la infraestructura y el volumen de datos que deben ser almacenados previamente al entrenamiento. Con el fin de paliar algunos de esto problemas, Google ha desarrollado lo que se denomina como Aprendizaje Automático Federado (Federated Machine Learning) que permite a los dispositivos “móviles” aprender de forma colaborativa manteniendo los datos de entrenamiento en el propio dispositivo y compartiendo los modelos predictivos que han sido generados a nivel local que son utilizados para mejorar un modelo compartido que se encuentra almacenado en el Cloud con pequeñas aportaciones de todos los modelos locales. Este proceso permite mantener la privacidad de la información, disminuye la cantidad de información envidada desde los dispositivos y almacenada en el Cloud. Acompáñame en esta charla para descubrir como funciona el Aprendizaje Automático Federado. Para ello presentaré los diferentes conceptos teóricos que hay detrás de este nuevo modelo de aprendizaje y veremos de manera práctica como podría implementarse un sistema de generación de modelos basados en aprendizaje federado mediante la utilización de TensorFlow.

Optimus Price: Dynamically Adapting to the Marketplace

Price optimization is a complex task, especially for a two-sided marketplace where maintaining the balance between demand and supply is key for the health of the business. In the case of Cabify, deviations from such equilibrium can be harmful in several ways. Increasing prices too much will dissuade users from requesting rides, but by lowering them we may over-burden our fleet or in the worst case even lose the drivers’ interest in favor of more profitable competitors. The challenge of finding the perfect balance point is made even more arduous by its non-stationarity. Dramatic shocks like the ongoing global pandemic can shift the market to a different state and we must be able to adapt promptly. At Cabify, we have teams of experts that analyze the market and make calls on what are the best prices for each city based on their intuition and knowledge of the local particularities. Besides, we have developed tools that enable our experts to quantitatively evaluate their decisions in controlled experiments. In these experiments, new pricing schemes are tested on a random subset of users to understand how they react to them. This approach, while it has proven successful until now, is showing its limitations in terms of scalability and readiness for a business that is rapidly growing in size and complexity. We can not rely so heavily on human experts for every step of the process any more: we need to step up our game and build an automatic pricing system. Our challenge consists of how to make automatic decisions under high uncertainty. For this reason, we started exploring the use of Reinforcement Learning techniques for pricing optimization. The problem at hand consists of choosing the best price for each type of ride, with the handicap of not knowing how users will respond to these prices in advance. Our problem is similar to the one of the Multi-Armed Bandit problem, which consists of allocating a fixed set of resources among competing choices from which we don’t know which one is the best. We believe that this framework, widely used in online marketing to serve ads, can be adapted to the pricing task. Different algorithms were tested, and in the end, a custom solution was build that leverages the natural structure of the problem: the cheaper a price is, the more likely is to be accepted. Once built the automatic pricing system, we tested it in a controlled experiment before fully enabling it. We satisfactorily found that it worked as expected: by increasing or decreasing prices this system was capable of finding a better price that improved our driver’s earnings significantly. While making this step ahead, however, we could already see in which direction the next one should be. The data from our experiments revealed that not all types of metrics improved. We were able to spot that in some cases the mid-term user retention was being negatively affected by the new prices, even if in the short term it was bringing better revenues. A high price does not only influence users’ instant decision on a journey, but also their overall experience and the chance of using Cabify in the future. By developing a myopic solution we were ignoring the impact on long-term retention, thereby risking losing users in the long run. We are currently improving this solution to include these long-term factors in the pricing decision. While still a Work-In-Progress, this solution is a good example of how Reinforcement Learning is not only good to create invincible gaming AIs, but also to have a positive impact in real-world business.

Industrializing your Machine Learning workflows in the cloud

The data science practice is moving faster than ever, but its evolution together with the maturity of the wide range of Machine Learning tools available are becoming increasingly complex to manage. Currently, organizations of all sizes are looking for ways to increase efficiency while reducing the time to market through fully automated ML pipelines, most of them embracing the advantages that the public cloud provides today. Bringing those ML pipelines to the cloud with modern CI/CD operational models involves a series of cross-teams challenges and its best practices associated known as MLOps, which sometimes also involves full transformations in the structures of the teams in the organizations. In this session, we will explore the current state of the art for industrializing Machine Learning workflows in the cloud through MLOps pipelines, and how some of the most innovative companies in the world have solved the new challenges with innovation. We will cover the best practices in modern solutions towards important technical principles like consistency, flexibility, reproducibility, reusability, scalability, and auditability, and consider several technical options for orchestrating the pipelines with native cloud services in Amazon Web Services (AWS), like Amazon SageMaker or AWS Step Functions, as well as its possible integrations with some modern open source alternatives like Kubernetes and Kubeflow pipelines. Each alternative will be analyzed with its advantages and disadvantages, and the way to increase the efficiencies following the MLOps principles commented. We will also comment some real examples to cover the most common concerns we see in some companies like: Are you familiar with common design anti-patterns for MLOps like the “Superhero data-scientist dependency”, the “ML black-box”, or the “Deep embedded failure”?; are your teams prepared for operating and supporting machine learning workloads in production?; are you properly documenting and tracking model creation and its lineage and changes?; have you fully automated the end to end development and deployment pipeline of your machine learning workloads?; are you properly monitoring and logging the model hosting?; are you considering the automated re-training workflows when new data is available?; and after all… how can you accelerate the whole process again?. These examples incorporate new interesting technical and architectural concepts that are arising in many companies around the world like: “ML-lakes”, the “ML factories”, or the “universal ML pipelines”. Finally, we will explore how to combine the establishment of solid cloud IT governance mechanisms in the ML workloads for: security, compliance, and spend management; while staying agile and innovating with the speed of the cloud in the development of ML projects for self-service access, fast experimentation, and quick response to changes. This is particularly important for companies that are using machine learning in areas like financial or insurance, but still applicable to any type of company that is embracing ML at scale in the cloud. By the end of the session you should have a clear view on the motivation behind MLOps, the technical alternatives currently available in the cloud for implementing modern machine learning workflows with fully automated pipelines, and how to accelerate all the journey for you and your organization.

Using Graphs for promotion management at El Corte Inglés

El Corte Inglés is a leading business group in Spain, and its Retail business line, from department stores to small stores, has led this sector in Spain over the last century. We will talk about how El Corte Inglés has managed to improve its business process for managing promotions based on the implementation of graph technology using the Neo4j platform. To maintain its leadership in an increasingly competitive market, El Corte Inglés is constantly on the move to adapt to new trends and maintain closeness with all its customers through all its channels, with a special focus on the online channel through the acceleration that this medium has had as a result of the COVID-19 pandemic. Challenge: to understand the challenge from the point of view of the business process, we have to explain it a little more in detail. El Corte Inglés product catalogue today includes around 50 million SKUs, subject to be promoted in some way. Promotions have to be uploaded and associated with the products, once this is done, it is necessary to flatten the information in order to reach all the channels. Once the information has been “flattened” or processed, the content must be “exploded”, that is the way promotions reach the channels so that users can see and use those offers. One of the characteristics of traditional systems was that they operated in nightly batch processes that took about 8 hours to calculate and distribute the information, a costly process that required several systems, precisely because it had to search through JOIN for the relationships between the entities contained in its database. This methodology is not viable today, since there are sales channels open 24 hours a day that need to be updated in real time. We are going to talk about what were the traditional technologies, the architecture and the reasons why they were not efficient and not scalable at all. Solution: El Corte Inglés evaluated many technologies and finally understood the value of Neo4j as the leader in native graph databases to manage this challenge. We bet on this platform to evolve this entire promotion management system. In 2019, work began to implement this technology. We will talk about the first phases of the implementation and the objectives achieved since then with the new approach, also reviewing the current architecture to highlight the efficiency that graphs bring to the management of promotions. Looking to the future, especially in this new post-pandemic scenario, business needs have evolved and El Corte Inglés has seen how the use of graph databases can provide benefits beyond the management of promotions, so we will talk about the medium and long-term challenges of El Corte Inglés in terms of the evolution of this project and its probable use in other use cases directly linked to this process, which make up the steps to reach the digitization of a Retail giant to continue being leader in its market. The next planned steps are the real-time integration with the web front-end and consolidating data from various business areas, such as customer position and warehouse control, so that the calculation of promotions is not only based in the product, but also in highly relevant variables, which take into account more data, specifically the client and its context, the existence and delivery costs of each product, with the aim of offering an increasingly personalized experience to customers through all available channels.

Building Continuous Integration Pipeline With Jupyter Notebooks on GCP

It is a common belief that Jupyter Notebooks and production are not compatible. This leads to a several consequences: * Just a very few ML models and experiments evolve to a production environment * It takes very long time to deliver a simpler model to production even for the most valuable and business critical ML models We do believe that this can be drastically improved by providing a way of using Jupyter Notebooks directly into a production environment. And in this talk we will show you how to do it.

The Many Models Pattern – A Case of Site Level Energy Demand Forecasting

As companies mature through their Machine Learning journey, a pattern of “many models” often emerges. In the real world, many problems can be too complex to be solved by a single machine learning model. Whether that be predicting sales for each individual store, building a predictive maintenance model for thousands of oil wells, or tailoring an experience to individual users, building a model for each instance can lead to improved results on many machine learning problems, as opposed to training a single model to make predictions for all instances. However, the infrastructure, procedures and level of automatization required to operate this kind of pattern poses a challenge at all levels. The combined use of cloud services, machine learning traceability systems and DevOps practices allow any business to build a scalable platform that uplifts the technical foundations of their data and data science capabilities in standardization, rapid model development and deployment at scale, unlocking existing business value and creating new opportunities to accelerate business growth. In this talk we present the many models pattern and how it could be applied to an energy business to quickly prototype and launch a site level energy demand forecasting use case, leveraging the end-to-end capabilities of the Azure stack. This many-models solution predicts energy demands right down at the meter level and is used to plan energy trading responses to grid demand. The use case incorporates thousands of individual models that are trained with specific data, with predictions every few minutes (resulting in millions of calls per day) and includes parallelized and selective model retraining triggered by drift. It incorporates parallel at-scale training (leveraging on-demand compute), end-to-end model and code management, automated MLOps deployments, real-time availability of models via webservices, Kubernetes model hosting and performance monitoring. The solution leverages an “Infrastructure as Code” and “Continuous Integration/Continuous Delivery” approach that ensures their data science and engineering resources are transferable across business initiatives with all ML artefacts fully code controlled, managed and documented. The target audience for this session are either: – Data product owners that want to learn about the many models pattern, the business problems it can solve and the implications of building this solution. – Data scientists, data engineers and machine learning engineers who have worked with machine learning models and would like to learn how to manage the lifecycle of many models. By attending the session, the audience will: – Understand the kind of business problems where a many models pattern may apply. – Get an overview of the best practices needed to bring this solution to life. – Understand how to design a machine learning solution with thousands of parallelized model training. – Know how to package thousands of models in a handful of groups and deploy them into containers. – Learn how to make thousands of models accessible in a real-time manner via a single API. – Learn how to build the MLOps pipelines required for operationalizing this solution.

Loc Analytics: How data & AI can change Game Localization

The Loc Analytics team will talk about how EA’s Data & AI-based Video Game Localization is transforming. How is this change to culture being made and which tools have been created are key to this change. The creation of the Loc Data Hub is a key point of this transformation, the platform to centralize the data generated around the localization of the game by different internal and external sources, plus other multiple measures are the corner stone of this change

Tokenizing Flight’s CO2 emissions in COVID times

The aviation sector is responsible of the 2% of Global Greenhouse Emissions (CO2) and the sector is committed to reduce, by 2050, its net CO2 emissions to 50% of what they were in 2005. Considering that in 2037 the number of passengers will be doubled, at constant 3,7% anual growth, the aviation sector must optimize its fuel efficiency by 2% every year to achieve the goal by: better air operations (air navigation procedures), better techonology (engines and materials) and better fuels (biofuels). In addition to that, airlines must offset their emissions to be CO2 neutral with environmental projects through offsetting schemes, for example EU-ETS in Europe, CORSIA, new ongoing programme promoted by United Nations and IATA. The CORSIA and EU-ETS programmes are based in compensating emissions by financing a reduction in emissions elsewhere, and require detailed and specific procedures for data monitoring, verification and reporting, involving several stakeholders as airlines, independent auditors, national and supranational government agencies and finally, companies with environmental projects (renewable energy, carbon and methane sequestration, energy efficiency, technological process improvements) and carbon markets. The complexity and the uncertainty of data measuring, data gathering, data validation process, the fine-grained data reporting tracking and its custody, from a single flight to the national/supranational data registry is the key for the success of the quality of emissions data and its compensation through the whole process. We developed a POC, named ARETA (Aviation Real Time Environmental Token Accreditation) as an automated data platform that would help for the complex offsetting schemes processes, based on Monitoring, Verification and Report paradigm proposed in the offseting schemes. Also, the solution acts as carbon trading platform. Our main POC’s functionalities are: • Estimating in near-real time CO2 aviation emissions from real time flights open data. • Tokenizing its emissions and trading them through a Blockchain. • Creating a data-governed big data repository. • Visualizing dashboards with the data collected and processed for further use cases. Some of them in near-real time. • We also included, the estimation, in near-real time, of the related air quaility emissions (NOx, CO and HC) in airports and in en-route flight segments. We simulated the participation of all the stakeholders in the same unified platform for simplifyng the process and the emissions units (carbon tonnes) for setting a source of thruth data repository using Big Data and Blockchain capabilities, guiding the verification process through Smart Contracts and providing them dashboards for analyze the data. All the POC has in every step Data Governance disciplines in consideration. For developing the POC, we used open source tools as Neo4J, Spark, Elastic, MongoDB, Redis, BigchainDB and Kafka with Python as the core language. We will show: • the architecture and the pipeline of ARETA, using Big Data, Blockchain and Data Governance domains used in this POC. • The Data Governance we applied in the POC. Data lineage, data glossary / dictionary, data quality, metadata and master data management. • The dashboards we built. • And one more thing: unexpectedly, we started the POC in march in the middle of the COVID scemario, so we also extended the use cases for analizyng some of the impacts, we will show you some data and visualizations about the COVID scenarios.

DataQuality with Artificial Intelligence

The project consists of a solution to relate data sources that do not have an identifier by which to relate them, but is from different fields, which often do not have to match exactly or do not have to always be all informed This is a solution adapted to today’s times where we want to improve the quality of the information that is stored in order to exploit it analytically and also enrich it with external sources. Based on a person’s information, the algorithm checks both exact matches of the fields and matches of the fields by applying a certain preprocessing adapted to each field in question, as well as calculating distances to measure the degree of similarity between the two. values ​​that have not matched. In addition, according to the matching fields and the degrees of similarity, in the case of non-exact matches, a quality level is assigned to the candidate found, which measures the confidence that the candidate is the same as the incoming person. The algorithm returns the candidate (s) found in addition to the quality of each candidate, so that according to the degree of quality some actions or others are taken.

Unsupervised real-time anomaly detection and root cause estimation

Nowadays, an increasing number of business problems rely on the analysis of real-time metrics. Typical use cases range from credit fraud detection to predictive maintenance. Also, we are moving towards an era where all sensors and devices are connected to the internet, I.e. IoT, which monitor the performance of different KPIs. For this reason, it is crucial to extend and refine real-time analytics to streaming data sources to reach fast-developing sectors such as: Smart Cities, Industry 4.0, Smart Healthcare, etc. In this talk we will focus on unsupervised real-time anomaly detection. For this type of setups, it is a standard practice to set up thresholds for the detection of anomalies. Examples of this are: choosing, naïvely, fixed constant upper and lower bounds or estimating a threshold based on MAE with respect of previous data points. This type of thresholds may be suitable while training but may be not accurate enough once we put our model in production. Typically, these degrade with time, by detecting too many or too few anomalies, which forces us to change them later. We present an unsupervised real-time anomaly detector based on a forecaster based on LSTM neural networks. Anomalies in this case are data points which lay outside the confidence interval of the predictions. As we know, confidence intervals are not straightforwardly obtained from this type of neural networks, in contrast to well-known models as ARIMA. For that reason, we propose a way to obtain it using stochastic dropout. We set a Dropout layer after each of the LSTM layers used in the model. Once the model is trained, we bootstrap enough iterations to obtain the desired confidence intervals. In every iteration each layer will have a random dropout value which will disconnect weights between neurons randomly. By following this procedure, stochastic bootstrap makes this confidence intervals more reliable since the width of the interval will not depend on the dropout set before-hand. Also, with this technique the model can adapt their width, in metrics space, of the confidence intervals in such a way that the anomalies detected are truly outliers in the time series. In addition to this, we present an automatized way to detect the root cause of such anomalies in cases where we have at our disposal different metrics. For that end we focus on anomalous points that occur simultaneously in a subset of all the monitorized metrics. In order to estimate the root cause, we analyze correlations between the metrics that belong to this subset such as: cosine similarity, correlation of the tokenized metrics names, etc. In addition, we supply to this, events which may affect to the behavior of the metrics monitorized. Examples of that are: new releases of an app whose metrics we are monitoring, weather for the use of transportation services, etc. Altogether, gives us common features for all the anomalous points which may point towards the origin of the anomalous behavior detected which may be communicated in real-time.

Corrosion Detection in Repsol Refineries

Application of Computer Vision techniques together with advanced analytical techniques strongly supported by industrial knowledge that allow detecting the probability of interior and exterior corrosion in pipes, integrating different data sources such as flow rates, compositions, thickness measurements, … Made for 30 of the 3,500 lines at the Tarragona refinery, for internal corrosion, and 7 lines at the Tarragona refinery and 8 lines at the Coruña refinery for external corrosion Allows you to perform inspections more accurately and effectively.

Advances in AI – What’s in it for the Enterprise?

There have been amazing advances in AI over the last decade. These have led to huge impact, but what’s driven these advances? These questions will help us answer why the impact seems to be much more consumer centric, and what it would take to bring it to businesses. Does the same technology that delivered Google Photos and Translate, have something for the enterprise as well?

How can noise help us see the world better?

OKRA CEO Dr Loubna Bouarfa will share her perspective on how noise helps us see the truth more clearly. This is not only the case for human beings, but also for AI.

The endless repetition of the same Groundhog Day, the same challenges and tasks, forces us to discover the opportunities for a better world. Just like machines, by experiencing very subtle variations of one same thing over and over again we learn to identify the ground truth, we learn to find a way as humans and societies to become better and stronger.

Loubna Bouarfa will reveal how humans and machines are not so different after all. She will share some learnings from AI concepts about invariants, overfitting and regularisation, helping us to unveil the human power and learn from machines in the times of uncertainty.

Data Science to fight against COVID-19

In her talk, Nuria will describe the work that we have done within the Commissioner for AI Strategy and Data Science against COVID-19 for the President of the Valencian Region. As commissioner, I have led a multi-disciplinary team of 20+ scientists who have volunteered since March 2020. We have been working on 4 large areas: (1) human mobility modeling; (2) computational epidemiological models (both metapopulation and individual models); (3) predictive models; (4) citizen surveys, with the launch of the covid19impactsurvey, one of the largest citizen surveys about COVID-19 to date with over 300.000 answers (https://covid19impactsurvey.org)

I will describe some of the work we have carried out in each of these areas and will share the lessons learned in this very special initiative of collaboration between the civil society at large (through the survey), the scientific community (through the Expert Group) and a public administration (through the Commissioner at the Presidency level)

Markov Logic: A Step Toward AI

Intelligent systems must be able to handle the complexity and uncertainty of the real world. Markov logic enables this by unifying first-order logic and probabilistic graphical models into a single representation. Many deep architectures are instances of Markov logic. An extensive suite of learning and inference algorithms for Markov logic has been developed, along with open source implementations like Alchemy. Markov logic has been applied to natural language understanding, information extraction and integration, robotics, social network analysis, computational biology, and many other areas.

Automated semantic segmentation of large datasets

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for the semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud. An illustration of these techniques leverages MATLAB and its application-specific toolboxes and domain-specific functionalities. This accelerates the development of advanced driver-assistance systems (ADAS). It uses the CamVid dataset [3] to create and train a fully convolutional neural network FCN-8s [4] initialized by VGG-16 [5] weights. This model is then deployed at scale to execute optimally on NVIDIA® GPUs by leveraging the GPU Coder™. The same approaches can be used for other fully convolutional networks such as SegNet [6] or U-Net [7]. The trained models can be operationalized to scale against large datasets by leveraging Spark based systems on the cloud while measuring the performance and accuracy. The end outcome is faster semantic labeling of large datasets and better performance of labeling and inference pipelines. The architecture of a typical system is discussed both from the perspective of the domain/subject matter experts who develop the convolutional neural network models, as well as the IT/OT persona that is responsible for the deployment of the models, ETL pipelines, storage, and other underlying infrastructure. DevOps/MLOps maturity in such applications is discussed along with a high-level business viewpoint on the need and value of well architected systems and practices. Finally, the need for governance and lifecycle management of the models and underlying data for such applications is also discussed. The performance benchmarks presented for these end-to-end workflows underlines how engineers can scale their semantic segmentation workloads against large datasets of image and video data. References: [1] Linda G. Shapiro and George C. Stockman (2001): “Computer Vision”, pp 279–325, New Jersey, Prentice-Hall, ISBN 0-13-030796-3 [2] Barghout, Lauren, and Lawrence W. Lee. “Perceptual information processing system.” Paravue Inc. U.S. Patent Application 10/618,543, filed July 11, 2003. [3] Brostow, G. J., J. Fauqueur, and R. Cipolla. “Semantic object classes in video: A high-definition ground truth database.” Pattern Recognition Letters. Vol. 30, Issue 2, 2009, pp 88-97. [4] Long, J., E. Shelhamer, and T. Darrell. “Fully Convolutional Networks for Semantic Segmentation.” Proceedings of the IEE

Ray Project: Business Perspectives

This talk examines business perspectives about the Ray Project from RISELab, hailed as a successor to Apache Spark. Ray is a simple-to-use open source library in Python or Java, which provides multiple patterns for distributed systems: mix and match as needed for a given business use case – without tight coupling of applications with underlying frameworks. Warning: this talk may change the way your organization approaches AI.

 

Nearly 15 years ago, the speaker served as a “soundboard” for ideas about a new kind of computing service model, subsequently known as cloud computing. What has changed in a decade and a half? To paraphrase the UC Berkeley RISELab, one of the fundamental changes underway circa 2020 is to: “Pay to execute a block of code, rather than pay for allocating resources on which to execute code.” That may sound simple, although as shown by the commercial success of services such as Snowflake, the business implications can be staggering.

An important observation we’ll explore is that hardware is now evolving more rapidly than software, which in turn is evolving more rapidly than process. The economics of data analytics circa 2005 and the hardware used then – which shaped Big Data frameworks such as Hadoop, Spark, etc. – addressed the needs of ecommerce workloads such as log file aggregation at scale. Today, in a time of AI adoption, the most valuable IT workloads must address a new set of needs: differentiating gradients within the context of networked data. Hardware and software have both changed dramatically to address these new workloads, as seen by TensorFlow, PyTorch, etc. Process … not so much.

 

Ray supplies a much-needed control layer for distributing and optimizing workloads across hybrid architectures, while being mindful about the economics of computing. This concept – dubbed “infinite laptop” – is especially important for computing needs such as deep learning. It become even more crucial for the growing segments of AI technologies that require more sophisticated computing such as reinforcement learning, AutoML, and knowledge graph. Moreover, use of Ray fits into existing software engineering process more seamlessly than prior generations of distributed systems. We’ll draw from primary material by industry thought leaders such as Ion Stoica, David Patterson, Jeff Dean, Ben Lorica, and others to look at the architectural implications, as well as consider large use cases in business.

 

From Acorn to Oak: Seeding Federated Learning with Physical Models

Federated learning promises to improve data security, model accuracy and system resilience. Operational challenges dominate the time required to bring these promises to production: obtaining training data, comparing learning strategies and maintaining model integrity despite network unreliability. Techniques to address each of these problems are well-known: generating training data from physically accurate models, for example. But addressing each of these issues with individual applications creates inefficiencies: data scientists and architects must navigate a complex collection of parts rather than a seamless integrated solution. An efficient system must integrate best-in-class services and minimize or eliminate the boundaries between them. We develop, tune and deploy an anomaly-detecting machine learning model to demonstrate the enterprise benefits of streaming data from physical models into a federated learning architecture. Accurate physical models produce multiple training data sets, each training a single machine learning model to recognize a specific anomaly. Federated learning combines individual machine learning models into a robust production-ready classifier. Integrating streaming data into the development process mimics the production environment, enabling data scientists to validate their solutions under real-world conditions. A single platform for developing training, federated learning and classification algorithms enables a rapid feedback loop for model evolution. Sharing a networked datastore between the development environment and the production system provides a mechanism for continuous training and redeployment of improved models. Our system uses MATLAB for algorithm development and validation, Kafka for streaming data management, MATLAB Production Server to host classification algorithms, Redis for machine learning model deployment and Grafana to monitor the production system and display alerts for detected anomalies. Simulink models provide physically accurate synthetic data for the training data sets. We show how on-premises hosting speeds development and then scale the solution horizontally via integration with cloud-based platforms. We present both our architecture and a demonstration of the system in development and production. We will walk through the end-to-end workflow, with particular emphasis on the integration of streaming data into the development environment and the benefits to data scientists of a simulated production environment. We show how physical models accelerate bootstrapping the system by providing training data without requiring access to real-world assets, and how the use of model parameterization allows injection of behavioral anomalies into the data stream without damaging or destroying those assets. We discuss the system in the context of MLOps, highlighting operational successes and areas for future growth. In particular, the use of design principles such as dependency inversion allowed us to create a production-quality architecture focused on system integration and cooperation. Throughout, we emphasize the importance of knowing your core competencies and competitive advantages and using that understanding to choose between software development and component integration. We identify the strengths of our platform – algorithm development and physical model-based design — and show how that knowledge shaped the architecture of a federated machine learning system. Separating configuration from code, for example, was particularly important: provisioning strategies like infrastructure as code require architectures to be externally configurable. But designing an externally configurable system requires additional effort to choose, name and scope configuration parameters. We conclude with a summary of the effect of such architectural tradeoffs in an operational system as they informs the system’s evolution.

Accelerating repurposing of drugs for Covid-19 through Quantum-inspired Computing

The recent pandemic has already caused damage to economies around the world and requires a fast and accurate resolution. There are several ways of tackling the virus that range from blocking its entry into cells to inhibiting its replication. Either way, a treatment is urgently needed. Considering the length of time required for a new drug to be approved, repurposing approved drugs is a valuable option to accelerate the drug discovery process.

Virtual screening plays an important role at the early stages of drug discovery. This process generally takes a long time to execute since it typically relies on measuring similarities among molecules. This is a computationally heavy and expensive exercise, and a major challenge for today’s computers. Most of the well-known methods for this type of evaluation use 2D molecular fingerprints to encode structural information. Although they are efficient in terms of execution times, these methods lack the consideration of relevant aspects of molecular structures.

Considering 3D structural properties of molecules increases the accuracy of the results, at the expense of higher computing times. By using Digital Annealer, the mathematical model is able to manage this kind of information while having shorter executing times. Additionally, the solutions provided by Digital Annealer consider the percentage of similarity between the molecules being compared as well as the specific domains that are similar. The latter information is key to help experts to review the results, and better inform decision making for further validation, therefore significantly reducing times and optimizing the entire process.

In order to accelerate this research, the project has been carried out in collaboration with the Dept. Infectious Diseases within King’s College London. King’s college and Fujitsu are collaborating using their Quantum-Inspired technology, Digital Annealer to find similarities among already approved molecules and desired properties for future COVID-19 treatments

Streaming Machine Learning with Apache Kafka and without another Data Lake

Machine Learning (ML) is separated into model training and model inference. ML frameworks typically use a data lake like HDFS or S3 to process historical data and train analytic models. Model inference and monitoring at production scale in real time is another common challenge using a data lake. But it’s possible to completely avoid such a data store, using an event streaming architecture. This talk compares the modern approach to traditional batch and big data alternatives and explains benefits like the simplified architecture, the ability of reprocessing events in the same order for training different models, and the possibility to build a scalable, mission-critical ML architecture for real time predictions with muss less headaches and problems. The talk explains how this can be achieved leveraging Apache Kafka, Tiered Storage and TensorFlow.

From Containers to Kubernetes Operators for Datastores

“Containers are the new ZIP format to distribute software” is a fitting description of today’s development world. However, it is not always that easy and this talk highlights the development of a container strategy for datastores over time:

  • Docker images: A new distribution model.
  • Docker Compose: Local demos and a little more.
  • Helm Chart: Going from demo to production.
  • Kubernetes Operator: Full control with upgrades, scaling,…

Besides the strategy we are also discussing specific technical details and hurdles that appeared during the development. Or why the future will be a combination of Helm Chart and Operator (for now).