论文标题
不要在语音到文本翻译中丢弃固定窗口音频分段
Don't Discard Fixed-Window Audio Segmentation in Speech-to-Text Translation
论文作者
论文摘要
对于现实生活中的应用,至关重要的是,端到端的口语翻译模型在连续音频上表现良好,而不依赖于人类提供的细分。对于在线口语翻译,在说出完整话语之前,需要开始翻译模型,大多数以前的工作都忽略了细分问题。在本文中,我们比较了在离线和在线设置中改善模型对细分错误和不同细分策略的鲁棒性的各种方法,并报告了翻译质量,闪烁和延迟的结果。我们对五对不同语言对的发现表明,鉴于正确的条件,简单的固定窗口音频分割可以表现出色。
For real-life applications, it is crucial that end-to-end spoken language translation models perform well on continuous audio, without relying on human-supplied segmentation. For online spoken language translation, where models need to start translating before the full utterance is spoken, most previous work has ignored the segmentation problem. In this paper, we compare various methods for improving models' robustness towards segmentation errors and different segmentation strategies in both offline and online settings and report results on translation quality, flicker and delay. Our findings on five different language pairs show that a simple fixed-window audio segmentation can perform surprisingly well given the right conditions.