Language Models Fine-Tuning for Automatic Format Reconstruction of SEC Financial Filings

Lombardo, G.; Trimigno, G.; Pellegrino, M.; Cagnoni, S.

doi:10.1109/ACCESS.2024.3370444

The analysis of financial reports is a crucial task for investors and regulators, especially the mandatory annual reports (10-K) required by the SEC (Securities and Exchange Commission) that provide crucial information about a public company in the American stock market. Although SEC suggests a specific document format to standardize and simplify the analysis, in recent years, several companies have introduced their own format and organization of the contents, making human-based and automatic knowledge extraction inherently more difficult. In this research work, we investigate different Neural language models based on Transformer networks (Bidirectional recurrence-based, Autoregressive-based, and Autoencoders-based approaches) to automatically reconstruct an SEC-like format of the documents as a multi-class classification task with 18 classes at the sentence level. In particular, we propose a Bidirectional fine-tuning procedure to specialize pre-trained language models on this task. We propose and make the resulting novel transformer model, named SEC-former, publicly available to deal with this task. We evaluate SEC-former in three different scenarios: 1) in terms of topic detection performances; 2) in terms of document similarity (TF-IDF Bag-of-words and Doc2Vec) achieved with respect to original and trustable financial reports since this operation is leveraged for portfolio optimization tasks; and 3) testing the model in a real use-case scenario related to a public company that does not respect the SEC format but provides a human-supervised reference to reconstruct it.

Language Models Fine-Tuning for Automatic Format Reconstruction of SEC Financial Filings / Lombardo, G.; Trimigno, G.; Pellegrino, M.; Cagnoni, S.. - In: IEEE ACCESS. - ISSN 2169-3536. - 12:(2024), pp. 31249-31261. [10.1109/ACCESS.2024.3370444]