business triangle technical hexagon

Solving Natural Language problems with scarce data

Technical talk | English

Theatre 21: Track 5

Wednesday - 11.00 to 11.40 - Technical


Boosted by recent advances in Deep Learning the field of Natural Language Processing (NLP) is flourishing, producing remarkable solutions for highly complex problems such as machine translation, question answering, semantic similarity or textual entailment, to name a few. The key to such successes has been the development of very large neural network models able to capture long-term relationships between words. Unfortunately, with large models comes the necessity of large datasets to produce sensible predictions, thus makings this approach seemingly impractical for real-life problems where only a handful of texts are available for model training – and especially difficult when working with languages for which open datasets are barely available.

The breaking point in this trend has been the development of techniques that allow effective transfer learning in NLP models, in the form of Language Models. Transfer learning has been a cornerstone of image processing applications for the last 5 years, allowing to take very deep neural networks pre-trained on general image databases and apply them to more specific vision tasks. However, this approach has failed to find success when working with text data, due to the more complex structure of language: characters, words, suffixes, prefixes, long-distance relationships, etc. These hurdles have impeded the development of effective transfer learning in NLP, until recently.

A Language Model is a large deep network trained in an unsupervised way to model the distribution of words in a given language. It can be thought as an extension of the now classic Word Embedding techniques (word2vec, Glove, fasttext), in which not only a representation for each word is learnt, but also a complex mixing model able to intermingle the representation of each word with its neighboring words, thus producing a contextualized representation. Such language model can be trained from large, non-labeled corpora, such as Wikipedia, Twitter or internet dumps such as Common Crawl, hence removing the need for manual labeling. The transfer step happens when the language model is later fine-tuned to an specific or downstream NLP task. Even if scarce data is available for the downstream task, the knowledge transferred from the pre-trained language model can produce high quality results even in complex NLP challenges. Thus, this strategy of language modelling + downstream fine-tuning has become the new standard in NLP.

In this talk I will introduce the concept of language models, and review some of the state of the art approaches to building such models (BERT, GPT-2 and XLNet), delving into the network architecture and training strategies used in them. Then I will move on to show how these pre-trained language models can be fine-tuned to small datasets to produce high quality results in downstream NLP tasks, by making use of the open-source PyTorch-Transformers library ( This library is built on top of the PyTorch deep learning framework, and allows loading pre-trained language models and fine-tuning them easily.

Disclaimer: I’m not a developer of PyTorch-Transformers or the language models described above. Therefore, this talk will be focused on the theoretical grounds of these methods and on my practical experience in applying them.