Big Data Spain

17th ~ 18th NOV 2016 MADRID, SPAIN #BDS16

From data to numbers to knowledge: semantic embeddings

Thursday 17th

from 18:15 to 18:30

Theatre 18



One of the main promises of Big Data has been the power to tackle with data variety. Unstructured data such as images, natural language or videos should be ingested and analyzed with the same easeness has traditional, structured SQL data. Except it's never that easy.

The missing and seldom mentioned link in unstructured data processing is that of features computation. Unstructured data can not be directly processed by standard data analysis and machine learning methods, and so a feature generation step is required to transform it to a structured representation. Such feature generation process is generally application dependent, and requires to make use of the time of image, video or language experts to achieve optimal results.

A rising trend in machine learning to overcome this problem is to make use of semantic embeddings of objects of any kind, mapping text or images to an structured vector space where comparisons and analysis becomes easy. Techniques such as ""word2vec"", deep convolutional networks and recurrent networks have produced surprising results in semantic understanding of words, sentences or documents, and on very high-level tasks such as artistic style representation. More importantly, most of these techniques work in an unsupervised fashion, allowing to build powerful semantic interpreters from large unlabeled corpus of documents or images.

In this talk I will provide a fast review of the applications of semantic embedding methods. Starting with text analysis, we will see how the simple yet effective word2vec method is able to generate semantic vectors of words, giving rise to the powerful concept of semantic algebra, where simple mathematic operations can be applied to word semantics, such as “king – man + woman = queen”. Pushing this further with the aid of recurrent networks similar semantic ideas can be applied to whole sentences or documents, finding practical applications in automated translation, language generation, or chatbots.

After this I will present the key ideas behind deep convolutional networks, able to transform an image into a vector useful for recognition and classification tasks, as well as for the previously impossible feat of artistic style transfer between paintings. These techniques can be exploited jointly with the previously presented text analysis approaches, giving rise to new and surprising applications such as automated generation of image captions.

All these embedding methods provide of a deeper and more effective way of dealing with complex data sources, opening new business oportunities and making it more true than ever that Big Data is about coping with variety. But the most surprising fact of this “embedding revolution” might be that all of the underlying technologies are readily available as open source libraries, their fundaments thoroughly explained in publicly available papers. The only real requirement to put them into use is count with the appropriate experts able to apply them to practical and effective solutions.

Álvaro Barbero foto

Álvaro Barbero

Instituto de Ingeniería del ConocimientoChief Data Scientist