Talk | Technical | English

When looking for applications of Natural Language Processing models, the law sector clearly stands out as a prime candidate. A significant share of the time of professional lawyers can be spent organizing and going over large amounts of documents, either in the form of actual paper or as digital scans of varying quality. Therefore, automated tools that help classifying and navigating through all the information are an invaluable aid in optimizing time and costs.


In the last few years, the advent of large language models such as BETO or GPT has greatly broadened the applicability and efficacy of Natural Language Processing (NLP) solutions. Using a language model as the foundations of an NLP solution has allowed to produce state of the art results in highly complex tasks such as machine translation, question answering, summarization, language generation and many others. Seemingly, language models have become a kind of magic wand that can solve any NLP task, as long as one chooses the correct pre-trained model and tunes it appropriately. And certainly, the results that can be achieved in open datasets by following this recipe are impressive. But when the rubber hits the road in actual applications, real world data proves to be way more difficult to handle: poor quality scans, documents in the hundreds or thousands of pages, or unavailability of public datasets for the problem at hand, are just but a few of the challenges that must be overcome.


In this talk we will present the details of “Mapa del Expediente”, a joint R+D project between IIC and the law firm Garrigues. The project applies the latest advances in Spanish language models to organize and classify all the documentary information relating to a case, aiding the lawyer in navigating and perusing all this information.


The system we developed can work with raw PDF files in the form of image scans, ranging in the thousands of pages, and joining in the same PDF file a wealth of different kinds of documents with no index or clear-cut boundaries. Using a pipeline of custom language models and optical character recognition and preprocessing tools, our system is able to extract digital text out of the PDF file, discard pages with no useful information, break down the file into each of its logical documents, classify each of them into a taxonomy, and detect mentions to relevant entities such as persons or companies. This produces a highly structured version of the case files, which can then be integrated with a fuzzy search engine and visualization tools to allow easy navigation through all the information, as well as to produce graphs revealing the connections between all the individuals, organizations and documents in the case.


Mapa del Expediente is the product of an interdisciplinary team integrating experts in computational linguistics, data scientists, computer engineers and lawyers. This has allowed us to create and annotate our own corpora, develop custom tools and fine-tune all language models to the project needs, which has proven to be key to its success. Also as part of this talk we will introduce LegalBETO, a Spanish language model developed within this project and specialized for the legal domain. LegalBETO produces the best results for the benchmarks ran with real case file documents, performing over all publicly available models for the Spanish language, both for the general and legal domain.