17 November. 16.45 - 17.25 | Garage

Natural Language Processing (NLP) is nowadays one of the main points of focus of artificial intelligence and machine learning technologies. While conversational agents such as Siri or Alexa are the most visible representatives of NLP, this field finds wide applications in search engines, chatbots, customer service, opinion mining, and so on. The high levels of success that such NLP solutions have achieved in recent years are mostly fueled by three factors: the public availability of very large datasets (corpora) of web text, the fast upscaling in specialized hardware capabilities (GPUs and TPUs), and the improvements of deep learning models adapted for language. Focusing on this last point, the so called “language models” have been proven to be quite effective in leveraging the large datasets available. A language model is a deep artificial neural network trained on unlabeled corpora with the aim of modelling the distribution of words (or word pieces) in a particular language. In this way, and while trained in an unsupervised fashion, a language model is able to perform NLP tasks such as filling gaps in sentences or generating text following a cue. Furthermore, large language models such as GPT-2, RoBERTa, T5 or BART have proven to be quite effective when used as foundations to build supervised models addressing more specific or downstream NLP tasks like text classification, named entity recognition or textual entailment. Further specialized language models such as DialoGPT, Pegasus or Mbart have presented even better results for complex tasks such as context-free conversation, summarization and translation. And the extremely large model GPT-3 has presented impressive results in a wider variety of NLP tasks while being trained in a purely unsupervised manner. However, most of the language models that are available as open-source tools focus solely on the english language. While models for other languages do exist (BETO, CamemBERT, RobBERT, GreekBERT, …) they are usually trained on smaller corpora than english models, thus producing lower quality results. Multilingual versions of some of the most popular language models do exist, though they usually underperform when compared to monolingual models if tested on tasks other than machine translation. As an interdisciplinary team of experts in data science and computational linguistics, in this talk we will present our experience in applying language models to solve NLP tasks in a language different than english: spanish. Although spanish is currently spoken by about half a billion people in the world, it falls way behind english in the amount of NLP resources available. The most frequently used spanish language model is BETO, trained on the Spanish Unnanotated Corpora (SUC). While this spanish-only model produces better results than multilingual models, there is still plenty of room for improvement when compared to english models. We will present how we have re-trained BETO using a larger corpus to improve results in downstream NLP tasks. In particular, we have made use of the OSCAR corpus, about 9 times larger than SUC, together with some semi-automated corpus cleaning strategies to improve the BETO model. Results on a variety of text classification and named entity recognition tasks have shown that this approach is practical and effective to produce a better fit.