论文标题

Sinhala句子嵌入:低资源语言的两层结构

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages

论文作者

Weeraprameshwara, Gihan, Jayawickrama, Vihanga, de Silva, Nisansa, Wijeratne, Yudhanjaya

论文摘要

在数字建模自然语言的过程中,开发语言嵌入是至关重要的一步。但是,为诸如僧伽罗(Sinhala)等资源贫乏语言开发功能性嵌入是一项挑战,为此,很大的语言,有效的语言解析器以及任何其他必需的资源都难以找到。在这种情况下,对现有模型的开发提出了一种有效的嵌入方法来代表文本,这可能是相当富有成果的。本文探讨了几个单层和两层嵌入体系结构在表示情感分析域中的僧伽罗语文本时的效果。有了我们的发现,证明,下层由单词嵌入和上层组成的两层嵌入式体系结构已被证明可以比单词嵌入的句子嵌入更好的表现更好,从而实现了88.04%的最高F1分数,与单词嵌入模型达到的83.76%相反。此外,还开发了双曲线空间中的嵌入,并将其与欧几里得嵌入在性能方面进行比较。这项研究已使用了由Facebook帖子和相关反应组成的情感数据集。为了有效地比较不同嵌入系统的性能,已经对情感数据进行了相同的深神经网络结构,并使用用于编码相关文本的每个嵌入式系统进行了培训。

In the process of numerically modeling natural languages, developing language embeddings is a vital step. However, it is challenging to develop functional embeddings for resource-poor languages such as Sinhala, for which sufficiently large corpora, effective language parsers, and any other required resources are difficult to find. In such conditions, the exploitation of existing models to come up with an efficacious embedding methodology to numerically represent text could be quite fruitful. This paper explores the effectivity of several one-tiered and two-tiered embedding architectures in representing Sinhala text in the sentiment analysis domain. With our findings, the two-tiered embedding architecture where the lower-tier consists of a word embedding and the upper-tier consists of a sentence embedding has been proven to perform better than one-tier word embeddings, by achieving a maximum F1 score of 88.04% in contrast to the 83.76% achieved by word embedding models. Furthermore, embeddings in the hyperbolic space are also developed and compared with Euclidean embeddings in terms of performance. A sentiment data set consisting of Facebook posts and associated reactions have been used for this research. To effectively compare the performance of different embedding systems, the same deep neural network structure has been trained on sentiment data with each of the embedding systems used to encode the text associated.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源