business triangle technical hexagon

Disentangling risks, activity and performance through US corporate reports

Business talk | English

Theatre 17: Track 2

Thursday - 15.25 to 16.05 - Business


- - -

Big Data and Data Science techniques allows us to measure and analyze text using natural language processing methodology, also known as text mining or computational linguistics. The information included in the form of text could fully complement and improve our structured databases traditionally used in economic research. Thus, using statistical techniques and computational tools, we quantify text extracting meaning from letters, that is, we convert text into data. This novel approach, which permits to complement and combine traditional economic tools with new emerging tools thanks to the use of Big Data, has plenty of applications with a huge potential for economic research. In this project, we analyze risk interconnectedness, uncertainties and economic performance in US through corporate reports using advanced natural language processing techniques and machine learning. We developed a set of indicators for economic, risk and sectorial analysis in order to have a greater understanding of the US economy and business performance according to the idiosyncratic narrative of the reports submitted by US companies from 1995 to nowadays.

We get the US corporate reports using the Filings submitted by US companies to the SEC (Securities and Exchange Commission), which serves as a repository of over 21 million corporate filings. The federal securities laws require publicly reporting companies to disclose information on an ongoing basis. We use quarterly reports (on Form 10-Q), which includes unaudited financial statements and provides a continuing view of the company’s financial position during the year.

With the massive amount of extracted text, we apply a vast range of natural language processing algorithms to clean, process and test the data. Then, we use computational linguistics and unsupervised learning analysis to get the most commented topics in the report. Particularly, we use dynamic topic modeling, based on Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003), which is a Bayesian model assuming a document is generated by a latent mixture of topics. These mixtures follow a Dirichlet prior distribution. To introduce time-series dependencies into the data generating process, we use the dynamic topic model (DTM), a particularization of the Structural Topic Models (STM) where each time period has a separate topic model and time periods are linked via smoothly evolving parameters.
Moreover, we apply sentiment analysis through the lexicon approach and machine learning techniques to get the perception of those topics and their evolution over time. Therefore, we can have a better knowledge of what are the main concerns of US businesses, how they evolve over time and what is their impact on the economic activity.

Additionally, we use WordtoVec as a method to word embedding using shallow neural network in order to identify the most important words related with some risk, activity and uncertainty keywords. This methodology helps us to elaborate several dictionaries to monitor strategic indicators of the evolution of the US economic performance over time. This new set of information coming from text offers high granularity data, covering the gap in the publication of official data to better understand the behavior and driving forces of US business performance.

Taking into account the size and firm heterogeneity of our sample, we analyze the main identified topics by sector of activity according to the Standard Industrial Classification that the SEC considers, identifying thus sectorial risks and broader risks common to different sectors.

The developed methodologies and tools show a tremendous usefulness to asses risk management and better understand the behavior and driving forces of business performance. There are many signals that precede a collapse, and tracking these trends will help us to anticipate the collapse, becoming an important early warning tool for economic analysis.