我们研究了预训练的神经模型的鲁棒性特性,以自动语音识别。机器学习中的现实生活数据通常非常嘈杂,几乎永远不会干净,这可以归因于各种因素,具体取决于域,例如异常值,随机噪声和对抗性噪声。因此,我们为各种任务开发的模型应该对这种嘈杂的数据具有强大的稳健性,这导致了强大的机器学习的蓬勃发展。我们认为在自动语音识别的情况下考虑了这个重要问题。随着预训练模型的日益普及,分析和理解此类模型对噪声的鲁棒性是一个重要问题。在这项工作中,我们对LibrisPeech和Timit数据集进行了预训练的神经模型Wav2Vec2,Hubert和Distilhubert的鲁棒性分析。我们使用不同种类的尖锐机制,并测量由推理时间和标准单词错误率指标量化的模型性能。当在层之间注入噪声时,我们还对WAV2VEC2模型进行了深入的层分析,从而使我们能够在高级别上预测每个层学习的内容。最后,对于此模型,我们可视化整个层中错误的传播,并比较它在清洁数据与嘈杂数据上的行为。我们的实验构成了Pasad等人的预测。 [2021],还为未来的工作提出了有趣的方向。
translated by 谷歌翻译
Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.
translated by 谷歌翻译
自动语音识别系统为应用程序创建了激动人心的可能性,但是它们还为系统窃听的机会提供了机会。我们提出了一种方法来伪装一个人的声音,这些系统来自这些系统,而不会对房间里的人之间的谈话不方便。标准对策攻击在实时流动情况下无效,因为信号的特性将在执行攻击时发生变化。我们介绍了预测攻击,通过预测将来最有效的攻击预测攻击来实现实时性能。在实时约束下,我们的方法在通过字错误率通过字错误率测量的基本电咨询器中,我们的方法堵塞了37x的基线,而通过字符错误率测量。我们还展示了我们的方法在物理环境中实际上是在物理距离的现实环境中。
translated by 谷歌翻译
虽然已显示自动语音识别易受对抗性攻击的影响,但对这些攻击的防御仍然滞后。现有的,天真的防御可以用自适应攻击部分地破坏。在分类任务中,随机平滑范式已被证明是有效的卫生模型。然而,由于它们的复杂性和其输出的顺序性,难以将此范例应用于ASR任务。我们的论文通过利用更具增强和流动站投票的语音专用工具来设计一个对扰动强大的ASR模型来克服了一些这些挑战。我们应用了最先进的攻击的自适应版本,例如难以察觉的ASR攻击,我们的模型,并表明我们最强大的防守对所有使用听不清噪声的攻击是强大的,并且只能以非常高的扭曲破碎。
translated by 谷歌翻译
Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual enhancement tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE.
translated by 谷歌翻译
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
translated by 谷歌翻译
Deep learning (DL) has become a driving force and has been widely adopted in many domains and applications with competitive performance. In practice, to solve the nontrivial and complicated tasks in real-world applications, DL is often not used standalone, but instead contributes as a piece of gadget of a larger complex AI system. Although there comes a fast increasing trend to study the quality issues of deep neural networks (DNNs) at the model level, few studies have been performed to investigate the quality of DNNs at both the unit level and the potential impacts on the system level. More importantly, it also lacks systematic investigation on how to perform the risk assessment for AI systems from unit level to system level. To bridge this gap, this paper initiates an early exploratory study of AI system risk assessment from both the data distribution and uncertainty angles to address these issues. We propose a general framework with an exploratory study for analyzing AI systems. After large-scale (700+ experimental configurations and 5000+ GPU hours) experiments and in-depth investigations, we reached a few key interesting findings that highlight the practical need and opportunities for more in-depth investigations into AI systems.
translated by 谷歌翻译
自我监督学习(SSL)在语音识别方面取得了巨大的成功,而有限的探索已尝试完成其他语音处理任务。由于语音信号包含多方面的信息,包括说话者身份,副语言学,口语内容等,学习所有语音任务的通用表示都具有挑战性。为了解决该问题,我们提出了一个新的预培训模型WAVLM,以解决全堆栈的下游语音任务。 Wavlm共同学习了蒙面的语音预测和预训练。通过这种方式,WAVLM不仅可以通过掩盖的语音预测来保持语音内容建模能力,而且还可以通过语音denoing来提高非ASR任务的潜力。此外,WAVLM还采用封闭式的变压器结构的封闭相对位置偏置,以更好地捕获输入语音的序列排序。我们还将培训数据集从60k小时扩展到94K小时。 WAVLM大型在精湛的基准上实现了最先进的性能,并在其代表性基准上为各种语音处理任务带来了重大改进。代码和预培训模型可在https://aka.ms/wavlm上找到。
translated by 谷歌翻译
Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularly relevant pretext task, leading to useful self-supervised representations which prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on automatic speech recognition, speaker and emotion recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
translated by 谷歌翻译
Many self-supervised speech models, varying in their pre-training objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive empirical successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA). We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models.
translated by 谷歌翻译
我们介绍重要的是,通过向语音的不重要区域添加噪音而不是重要地区来增加语音分类和识别模型的技术来增加语音分类和识别模型的技术。通过培训的数据增强代理预测每个话语的重要性,以最大限度地提高它增加的噪声量,同时最小化其对识别性能的影响。我们的方法的有效性在谷歌语音命令(GSC)数据集中的两个版本上说明了。在标准GSC测试集上,与传统噪声增强相比,它实现了23.3%的相对差错率降低,该噪声增强在不考虑它可能最有效的地方的情况下对语音应用噪声。它还提供了25.4%的错误率与基线相比没有数据增强的基线。此外,所提出的重要名称优于常规噪声增强和两个测试集上的基线,并添加了附加噪声。
translated by 谷歌翻译
Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed.
translated by 谷歌翻译
自我监督的语音识别模型需要大量标记的培训数据,以学习自动语音识别(ASR)的高保真表示,这是计算要求且耗时的,从而阻碍了这些模型在资源受限环境中的使用。我们考虑确定最佳数据子集以训练ASR的自我监督语音模型的任务。我们表达了一个令人惊讶的观察,即用于采样最有用的示例中使用的数据集修剪策略并没有比随机的子集选择在微调自我监督的ASR任务上更好。然后,我们提出了Cowerage算法,以在自我监督的ASR中更好地子集选择,该算法是基于我们的发现,即确保基于培训单词错误率(WER)在早期训练时期的范围覆盖示例,可以提高概括性能。在WAV2VEC 2.0模型和TIMIT,LibrisPeech和LjSpeech数据集上进行的广泛实验显示了COWERAGE的有效性,比现有数据集修剪方法和随机采样的绝对改善高达17%。我们还证明,培训实例的覆盖范围可确保包括语音多样的示例,从而在自我监督的语音识别模型中更好地测试准确性。
translated by 谷歌翻译
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. 1 1 Code and models are available at https://github.com/pytorch/fairseq Preprint. Under review.
translated by 谷歌翻译
常规的自动语音识别系统不会产生标点符号,这对于语音识别结果的可读性很重要。随后的自然语言处理任务(例如机器翻译)也需要它们。标点符号预测模型上有许多作品将标点符号插入语音识别结果中作为后处理。但是,这些研究并未利用声学信息进行标点符号预测,并且直接受语音识别错误的影响。在这项研究中,我们提出了一个端到端模型,该模型将语音作为输入并输出标点的文本。在使用声学信息时,该模型有望在语音识别错误方面可靠地预测标点符号。我们还建议使用辅助损失,以使用中间层和未插入文本的输出来训练模型。通过实验,我们将提出的模型的性能与级联系统的性能进行比较。所提出的模型比级联系统获得更高的标点符号预测准确性,而无需牺牲语音识别错误率。还证明,使用中间输出针对未插入文本的多任务学习有效。此外,与级联系统相比,提出的模型仅具有约1/7的参数。
translated by 谷歌翻译
学龄前评估至关重要,因为它为教师和父母提供了有关儿童成长和成长的关键知识。冠状病毒大流行强调了在线评估学龄前儿童的必要性。这种在线测试需要各种技术,从Web应用程序开发到各种标准(例如语音识别)的各种人工智能模型。由于声学的波动和儿童和成人之间语音频率的差异,因此很难采用自动语音识别(ASR)系统,因为它们是在成年人的声音上预先训练的。此外,培训新模型需要大量数据。为了解决此问题,我们使用具有新的预训练目标的WAV2VEC 2.0模型为认知测试系统构建了ASR,称为随机频率音调(RFP),而我们的新数据集则在无意义的单词(MW)和New DataSet上进行了测试(MW)和快速自动命名(RAN)测试。由于这两个测试的特殊性,我们探索了许多模型,包括卷积神经网络(CNN)和WAV2VEC 2.0模型。我们的新方法在CommonVoice数据集的波斯部分上达到6.45的单词错误率(WER)。此外,我们的新方法在零和少数场景中产生积极的结果。
translated by 谷歌翻译
Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Librilight benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference.
translated by 谷歌翻译
在我们以前的工作中,我们提出了一个歧视性自动编码器(DCAE)进行语音识别。 DCAE将两个训练方案结合在一起。首先,由于DCAE的目标是学习编码器映射,因此重建语音和输入语音之间的平方误差被最小化。其次,在代码层中,基于框架的语音嵌入是通过最小化地面真相标签和预测的Triphone-State分数之间的分类跨熵来获得的。 DCAE是根据Kaldi工具包开发的,通过将各种TDNN模型视为编码器。在本文中,我们进一步提出了三个新版本的DCAE。首先,使用了一个新的目标函数,该函数使用了地面真相和预测的Triphone-State序列之间的分类跨膜和相互信息。所得的DCAE称为基于链的DCAE(C-DCAE)。为了应用于强大的语音识别,我们将C-DCAE进一步扩展到层次结构和平行结构,从而导致HC-DCAE和PC-DCAE。在这两个模型中,重建的嘈杂语音与输入嘈杂语音以及增强语音和参考清洁语音之间的误差之间的误差都归功于目标函数。 WSJ和Aurora-4 Corpora的实验结果表明,我们的DCAE模型优于基线系统。
translated by 谷歌翻译
我们提出了一种用于计算自动语音识别(ASR)中错误率的新方法。这个新的指标是针对包含半字符的语言,可以以不同形式编写相同的字符。我们在印地语中实施了我们的方法论,这是指示上下文中的主要语言之一,我们认为这种方法可扩展到包含大型字符集的其他类似语言。我们称我们的指标替代单词错误率(AWER)和替代字符错误率(ACER)。我们使用wav2Vec 2.0 \ cite {baevski2020wav2vec}训练我们的ASR模型。此外,我们使用语言模型来改善我们的模型性能。我们的结果表明,在分析单词和角色级别的错误率方面有了显着提高,ASR系统的可解释性提高了高达$ 3 $ \%的AWER,印地语的ACER $ 7 $ \%。我们的实验表明,在具有复杂发音的语言中,有多种写单词而不改变其含义的方式。在这种情况下,Awer和Acer将更有用,而不是将其作为指标。此外,我们通过新的公制脚本为印地语开了一个21小时的新基准测试数据集。
translated by 谷歌翻译
在过去的几年里,已经表明,深度学习系统在对抗性示例的攻击中非常脆弱。基于神经网络的自动语音识别(ASR)系统也不例外。有针对性的和未确定的攻击可以以这样的方式修改音频输入信号,使得人类仍然识别相同的单词,而ASR系统被转向以预测不同的转录。在本文中,我们提出了一种防御机制,该防御机制通过在向ASR系统馈送输入之前,通过应用慢速特征分析,低通滤波器或两者来删除来自音频信号的快速变化功能。我们对在这种方式预处理的数据训练的混合ASR模型进行了实证分析。虽然所产生的模型在良性数据上表现得非常好,但它们对针对性的对抗攻击进行了更高的稳健性:我们的最终建议的模型显示了与基线模型类似的清洁数据的性能,同时具有比较强大的四倍。
translated by 谷歌翻译