梵语的神经复合字（Sandhi）一代和分裂

论文标题

梵语的神经复合字（Sandhi）一代和分裂

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

论文作者

Dave, Sushant, Singh, Arun Kumar, P., Prathosh A., Lall, Brejesh

论文摘要

本文用梵语介绍了基于神经网络的方法，分别用梵语语言，分别称为sandhi和vichchhed。 Sandhi是梵语文本形态学分析至关重要的重要思想。 Sandhi导致单词边界的单词转换。 Sandhi组的规则定义得很好，但复杂，有时是可选的，在某些情况下，需要了解复杂的单词的性质。鉴于其非独特性和上下文依赖性，Sandhi Split或Vichchhed是一项更加艰巨的任务。在这项工作中，我们提出了使用现代深度学习技术将问题作为顺序预测任务的顺序进行序列的途径。作为第一个完全数据驱动的技术，我们证明，尽管没有使用任何其他词汇或形态学资源，但我们的模型的精度比多个标准数据集的现有方法更好。该代码可在https://github.com/iitd-datascience/sandhi_prakarana提供

This paper describes neural network based approaches to the process of the formation and splitting of word-compounding, respectively known as the Sandhi and Vichchhed, in Sanskrit language. Sandhi is an important idea essential to morphological analysis of Sanskrit texts. Sandhi leads to word transformations at word boundaries. The rules of Sandhi formation are well defined but complex, sometimes optional and in some cases, require knowledge about the nature of the words being compounded. Sandhi split or Vichchhed is an even more difficult task given its non uniqueness and context dependence. In this work, we propose the route of formulating the problem as a sequence to sequence prediction task, using modern deep learning techniques. Being the first fully data driven technique, we demonstrate that our model has an accuracy better than the existing methods on multiple standard datasets, despite not using any additional lexical or morphological resources. The code is being made available at https://github.com/IITD-DataScience/Sandhi_Prakarana

下载PDF全文

下载文献需遵守相关版权规定

论文标题