论文标题

跨模式的变异推断,用于徒的信号符号翻译

Cross-modal variational inference for bijective signal-symbol translation

论文作者

Chemla--Romeu-Santos, Axel, Ntalampiras, Stavros, Esling, Philippe, Haus, Goffredo, Assayag, Gérard

论文摘要

从信号中提取符号信息是一个活跃的研究领域,尤其是在音乐信息检索领域中。这项复杂的任务也与其他主题有关,例如提取或仪器识别,是一个苛刻的主题,它赋予了许多方法,主要基于基于先进的信号处理算法。但是,这些技术通常是非生成的,可以提取信号的确定物理特性(音调,八度),但不允许任意词汇或更一般的注释。最重要的是,这些技术是单方面的,这意味着它们可以从音频信号中提取符号数据,但不能执行反向过程并产生符号到信号的生成。在本文中,我们通过将此问题变成信号和符号域的密度估计任务,提出了一种信号/符号翻译的射击方法,这两者都被视为相关的随机变量。我们用两个不同的自动编码器估算了这个关节分布,一个用于每个域,其内部表示被迫与加法约束匹配,允许两个模型分别学习和生成,同时允许信号到符号和符号对信号推理。在本文中,我们在音高,八度和动态符号上测试了我们的模型,这构成了迈向音乐转录和标签受限的音频生成的基本步骤。除了其多功能性外,该系统在训练和发电期间也很轻,同时允许我们在文章末尾概述的几种有趣的创意用途。

Extraction of symbolic information from signals is an active field of research enabling numerous applications especially in the Musical Information Retrieval domain. This complex task, that is also related to other topics such as pitch extraction or instrument recognition, is a demanding subject that gave birth to numerous approaches, mostly based on advanced signal processing-based algorithms. However, these techniques are often non-generic, allowing the extraction of definite physical properties of the signal (pitch, octave), but not allowing arbitrary vocabularies or more general annotations. On top of that, these techniques are one-sided, meaning that they can extract symbolic data from an audio signal, but cannot perform the reverse process and make symbol-to-signal generation. In this paper, we propose an bijective approach for signal/symbol translation by turning this problem into a density estimation task over signal and symbolic domains, considered both as related random variables. We estimate this joint distribution with two different variational auto-encoders, one for each domain, whose inner representations are forced to match with an additive constraint, allowing both models to learn and generate separately while allowing signal-to-symbol and symbol-to-signal inference. In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and label-constrained audio generation. In addition to its versatility, this system is rather light during training and generation while allowing several interesting creative uses that we outline at the end of the article.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源