论文标题

使用变压器对长文档进行有效分类

Efficient Classification of Long Documents Using Transformers

论文作者

Park, Hyunji Hayley, Vyas, Yogarshi, Shah, Kashif

论文摘要

已经提出了几种使用变压器对长文本文档进行分类的方法。但是,在基准上缺乏共识,无法在不同方法之间进行公平的比较。在本文中,我们对针对各种基线和各种数据集测量的相对功效进行了全面评估 - 无论是准确性以及时间和空间开销而言。我们的数据集涵盖了二进制,多类和多标签分类任务,并表示各种信息的方式(例如,对制定分类决策至关重要的信息是在文档的开始或结束时)。我们的结果表明,更复杂的模型通常无法胜过简单的基线,并且在整个数据集之间产生不一致的性能。这些发现强调了未来研究的必要性,以考虑更好地代表长期文档分类以开发健壮模型的全面基线和数据集。

Several methods have been proposed for classifying long textual documents using Transformers. However, there is a lack of consensus on a benchmark to enable a fair comparison among different approaches. In this paper, we provide a comprehensive evaluation of the relative efficacy measured against various baselines and diverse datasets -- both in terms of accuracy as well as time and space overheads. Our datasets cover binary, multi-class, and multi-label classification tasks and represent various ways information is organized in a long text (e.g. information that is critical to making the classification decision is at the beginning or towards the end of the document). Our results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets. These findings emphasize the need for future studies to consider comprehensive baselines and datasets that better represent the task of long document classification to develop robust models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源