论文标题
无监督的单词多义量化具有上下文嵌入的多分辨率网格
Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings
论文作者
论文摘要
给定单词或多义的感官的数量是一个非常主观的概念,在注释者和资源之间差异很大。我们提出了一种基于上下文嵌入空间中简单几何形状的新方法来估计多义。我们的方法是完全无监督的,纯粹是数据驱动的。我们通过严格的实验表明,我们的排名良好相关(具有很强的统计意义),有6种不同的排名来自著名的人类建设资源,例如Wordnet,Ontonotes,Ontonotes,Ontonotes,Oxford,Wikipedia等,用于6种不同的标准指标。我们还可以看到和分析人类排名之间的相关性。我们方法的有价值的副产品是能够以无需额外费用的句子进行采样的能力,这些句子包含给定单词的不同感官。最后,我们方法的完全无监督的性质使其适用于任何语言。 代码和数据可在https://github.com/ksipos/polysemy-assessment上公开获取。 该纸在EACL 2021被接受为长纸。
The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language. Code and data are publicly available at https://github.com/ksipos/polysemy-assessment . The paper was accepted as a long paper at EACL 2021.