Blaser：无文本的语音到语音翻译评估指标

论文标题

Blaser：无文本的语音到语音翻译评估指标

BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric

论文作者

Chen, Mingda, Duquenne, Paul-Ambroise, Andrews, Pierre, Kao, Justine, Mourachko, Alexandre, Schwenk, Holger, Costa-jussà, Marta R.

论文摘要

通常通过基于文本的指标评估端到端语音到语音翻译（S2ST）。这意味着必须自动转录生成的语音，这使得评估取决于自动语音识别（ASR）系统的可用性和质量。在本文中，我们为端到端S2ST提出了一个无文本评估度量，名为Blaser，以避免对ASR系统的依赖性。 Blaser利用多语言的多模式编码器直接编码语音段，以源输入，翻译输出和引用到共享的嵌入空间中，并计算可以用作人类评估的代理的翻译质量的分数。为了评估我们的方法，我们从涵盖七个语言方向的40k人类注释中构建培训和评估集。 Blaser的最佳结果是通过在人类评分分数的监督下训练来实现的。我们表明，与ASR依赖性指标相比，Blaser在句子级别进行评估时，与人类判断的相关性明显更好。我们的分析表明，将语音和文本结合在一起作为Blaser的输入并不能增加与人类分数的相关性，但是在使用语音时，可以实现最佳的相关性，这激发了我们研究的目标。此外，我们表明将ASR用于参考对基于文本的指标有害。

End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题