论文标题

ESB:多域端到端语音识别的基准

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

论文作者

Gandhi, Sanchit, von Platen, Patrick, Rush, Alexander M.

论文摘要

语音识别应用程序涵盖了一系列不同的音频和文本分布,具有不同的口语风格,背景噪声,转录标点符号和角色套管。但是,许多语音识别系统都需要特定于数据集的调整(音频过滤,壳体的标点符号和壳体归一化),因此假设对音频和文本分布都有a-priori知识。此调整要求可能导致系统未能推广到其他数据集和域。为了促进多域语音系统的开发,我们介绍了端到端语音基准(ESB),以评估一组广泛的语音数据集中单个自动语音识别(ASR)系统的性能。基准的系统必须在数据集中使用相同的数据预处理和后处理算法 - 假设音频和文本数据分布是未知的。我们比较了该基准上的一系列最先进的(SOTA)端到端(E2E)系统,以说明如何在广泛的数据分布上应用和评估单个语音系统。我们发现E2E系统在跨数据集中有效:在公平的比较中,E2E系统可在调谐到特定数据集的SOTA系统的2.6%以内。我们的分析表明,转录伪影(例如标点符号和壳体)对ASR系统构成困难,应包括在评估中。我们认为,在一系列数据集上进行基准测试会促进多域语音识别系统的研究。 ESB可从https://huggingface.co/esb获得。

Speech recognition applications cover a range of different audio and text distributions, with different speaking styles, background noise, transcription punctuation and character casing. However, many speech recognition systems require dataset-specific tuning (audio filtering, punctuation removal and normalisation of casing), therefore assuming a-priori knowledge of both the audio and text distributions. This tuning requirement can lead to systems failing to generalise to other datasets and domains. To promote the development of multi-domain speech systems, we introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. Benchmarked systems must use the same data pre- and post-processing algorithm across datasets - assuming the audio and text data distributions are a-priori unknown. We compare a series of state-of-the-art (SoTA) end-to-end (E2E) systems on this benchmark, demonstrating how a single speech system can be applied and evaluated on a wide range of data distributions. We find E2E systems to be effective across datasets: in a fair comparison, E2E systems achieve within 2.6% of SoTA systems tuned to a specific dataset. Our analysis reveals that transcription artefacts, such as punctuation and casing, pose difficulties for ASR systems and should be included in evaluation. We believe E2E benchmarking over a range of datasets promotes the research of multi-domain speech recognition systems. ESB is available at https://huggingface.co/esb.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源