论文标题
Fetilda:长期财务文本文档的鳍状嵌入框架的有效框架
FETILDA: An Effective Framework For Fin-tuned Embeddings For Long Financial Text Documents
论文作者
论文摘要
非结构化数据,尤其是文本,在各个领域继续迅速增长。特别是,在金融领域,有大量累积的非结构化财务数据,例如公司定期向监管机构提交的文本披露文件,例如证券和交易委员会(SEC)。这些文档通常很长,并且倾向于包含有关公司绩效的宝贵信息。因此,从这些长文本文档中学习预测模型是非常兴趣的,尤其是用于预测数字关键绩效指标(KPI)。尽管在训练有素的语言模型(LMS)中取得了长足的进步,这些模型从大量的文本数据中学习,但他们仍然在有效的长期文档表示方面挣扎。我们的工作满足了这种批判性的需求,即如何开发更好的模型来从长文本文档中提取有用的信息,并学习有效的功能,这些功能可以利用软件财务和风险信息进行文本回归(预测)任务。在本文中,我们提出并实施了一个深度学习框架,该框架将长文档分为大块,并利用预先培训的LMS处理和将块汇总为矢量表示,然后进行自我注意力,以提取有价值的文档级特征。我们根据美国银行的10-K公共披露报告以及美国公司提交的另一个报告的数据集评估了我们的模型。总体而言,我们的框架优于文本建模的强大基线方法以及仅使用数值数据的基线回归模型。我们的工作提供了更好的见解,即如何利用预先训练的域特异性和微调的长输入LMS代表长文档可以提高文本数据的表示质量,从而有助于改善预测分析。
Unstructured data, especially text, continues to grow rapidly in various domains. In particular, in the financial sphere, there is a wealth of accumulated unstructured financial data, such as the textual disclosure documents that companies submit on a regular basis to regulatory agencies, such as the Securities and Exchange Commission (SEC). These documents are typically very long and tend to contain valuable soft information about a company's performance. It is therefore of great interest to learn predictive models from these long textual documents, especially for forecasting numerical key performance indicators (KPIs). Whereas there has been a great progress in pre-trained language models (LMs) that learn from tremendously large corpora of textual data, they still struggle in terms of effective representations for long documents. Our work fills this critical need, namely how to develop better models to extract useful information from long textual documents and learn effective features that can leverage the soft financial and risk information for text regression (prediction) tasks. In this paper, we propose and implement a deep learning framework that splits long documents into chunks and utilizes pre-trained LMs to process and aggregate the chunks into vector representations, followed by self-attention to extract valuable document-level features. We evaluate our model on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies. Overall, our framework outperforms strong baseline methods for textual modeling as well as a baseline regression model using only numerical data. Our work provides better insights into how utilizing pre-trained domain-specific and fine-tuned long-input LMs in representing long documents can improve the quality of representation of textual data, and therefore, help in improving predictive analyses.