ESB：多域端到端语音识别的基准

论文标题

ESB：多域端到端语音识别的基准

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

论文作者

Gandhi, Sanchit, von Platen, Patrick, Rush, Alexander M.

论文摘要

语音识别应用程序涵盖了一系列不同的音频和文本分布，具有不同的口语风格，背景噪声，转录标点符号和角色套管。但是，许多语音识别系统都需要特定于数据集的调整（音频过滤，壳体的标点符号和壳体归一化），因此假设对音频和文本分布都有a-priori知识。此调整要求可能导致系统未能推广到其他数据集和域。为了促进多域语音系统的开发，我们介绍了端到端语音基准（ESB），以评估一组广泛的语音数据集中单个自动语音识别（ASR）系统的性能。基准的系统必须在数据集中使用相同的数据预处理和后处理算法 - 假设音频和文本数据分布是未知的。我们比较了该基准上的一系列最先进的（SOTA）端到端（E2E）系统，以说明如何在广泛的数据分布上应用和评估单个语音系统。我们发现E2E系统在跨数据集中有效：在公平的比较中，E2E系统可在调谐到特定数据集的SOTA系统的2.6％以内。我们的分析表明，转录伪影（例如标点符号和壳体）对ASR系统构成困难，应包括在评估中。我们认为，在一系列数据集上进行基准测试会促进多域语音识别系统的研究。 ESB可从https://huggingface.co/esb获得。

Speech recognition applications cover a range of different audio and text distributions, with different speaking styles, background noise, transcription punctuation and character casing. However, many speech recognition systems require dataset-specific tuning (audio filtering, punctuation removal and normalisation of casing), therefore assuming a-priori knowledge of both the audio and text distributions. This tuning requirement can lead to systems failing to generalise to other datasets and domains. To promote the development of multi-domain speech systems, we introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. Benchmarked systems must use the same data pre- and post-processing algorithm across datasets - assuming the audio and text data distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA) end-to-end (E2E) systems on this benchmark, demonstrating how a single speech system can be applied and evaluated on a wide range of data distributions. We find E2E systems to be effective across datasets: in a fair comparison, E2E systems achieve within 2.6% of SoTA systems tuned to a specific dataset. Our analysis reveals that transcription artefacts, such as punctuation and casing, pose difficulties for ASR systems and should be included in evaluation. We believe E2E benchmarking over a range of datasets promotes the research of multi-domain speech recognition systems. ESB is available at https://huggingface.co/esb.

下载PDF全文

下载文献需遵守相关版权规定

论文标题