我们提出了一种多阶段的多代码书(MSMC)方法,用于高性能神经TTS合成。基于矢量定量的,变异的自动编码器(VQ-VAE)的特征分析仪用于编码语音训练数据的MEL频谱图,通过在多个阶段中逐渐减小为MSMC表示(MSMCR),并使用不同的时间分辨率进行逐步降低,并使用多个VQ对其进行量化它们代码书分别。通过最大程度地减少重建均方根误差(MSE)和“三重态损耗”的合并损失,对多阶段预测指标进行了训练,以逐步将输入文本序列映射到MSMCR。在合成中,神经声码器将预测的MSMCR转换为最终的语音波形。拟议的方法是由女演讲者通过16小时的英语TTS数据库进行了训练和测试。拟议的TTS的MOS得分为4.41,其表现优于基线,MOS为3.62。拟议的TTS的紧凑版本仍然可以保留高MOS得分。消融研究表明,多个阶段和多个代码手册都可以有效地实现高TTS性能。
translated by 谷歌翻译
阿尔茨海默氏病(AD)的早期诊断对于促进预防性护理和延迟进展至关重要。基于语音的自动广告筛选系统为其他临床筛查技术提供了一种非侵入性,更可扩展的替代方案。此类专业数据的稀缺性会导致模型选择和特征学习的不确定性。为此,本文调查了功能和模型组合方法的使用,以改善Bert和Roberta预先训练的文本编码有限数据的域微调的鲁棒性,然后在将结果的嵌入功能馈入后端分类器集合之前通过多数投票制定最终的广告检测决定。在ADRESS20挑战数据集上进行的实验表明,使用模型和功能组合在系统开发中获得了一致的性能改进。使用手册和ASR语音转录本在ADRESS20测试集上分别获得了91.67%和93.75%的最先进的AD检测精度,该准确的准确性是由48位老年人组成的。
translated by 谷歌翻译
近年来见证了自动扬声器验证(ASV)的非凡发展。但是,先前的作品表明,最新的ASV模型非常容易受到语音欺骗的攻击,而最近提出的高性能欺骗对策(CM)模型仅专注于独立的反欺骗任务,而忽略了该模型随后的发言人验证过程。如何将CM和ASV集成在一起仍然是一个悬而未决的问题。最近发生了欺骗意识的说话者验证(SASV)挑战,即当共同优化CM和ASV子系统时,可以提供更好的性能。在挑战的情况下,参与者提出的集成系统必须同时拒绝冒名顶替者和欺骗目标扬声器的攻击,这些攻击者直觉有效地与可靠,欺骗的ASV系统的期望相匹配。这项工作着重于基于融合的SASV解决方案,并提出了一个多模型融合框架,以利用多个最先进的ASV和CM模型的功能。拟议的框架将SASV-EER从8.75%提高到1.17 \%,与SASV挑战中最佳基线系统相比,相对改善为86%。
translated by 谷歌翻译
情感识别是需要自然与人类互动的人工智能系统的关键属性。但是,由于情感的固有歧义,任务定义仍然是一个空旷的问题。在本文中,提出了一种基于dirichlet的新型贝叶斯训练损失,以提议言语情感识别,该言语识别为言语识别而建模,它将人类注释者分配给不同的情感类别的单次注释时产生的单热标签的不确定性。一个额外的指标用于通过具有高标签不确定性的检测测试说法来评估性能。这消除了一个主要局限性,即情绪分类系统仅考虑大多数注释者就情感阶级一致的标签来考虑话语。此外,研究了一种常见的方法,以利用通过平均单热标签获得的连续价值“软”标签。我们提出了一个双分支模型的结构,用于以每种能力为基础的情绪分类,该结构在广泛使用的Iemocap数据集上实现了最新的分类结果。基于此,进行了不确定性估计实验。当在Precision-Recall曲线下,当检测高不确定性的话语的情况下,可以通过对软标签的Kullback-Leibler Divergence训练损失来实现最佳性能。使用MSP播客数据集验证了所提出的方法的通用性,该数据集产生了相同的结果模式。
translated by 谷歌翻译
一个名为语音处理通用性能基准(Superb)的排行榜,它旨在基准测试各种下游语音任务的共享自我监督学习(SSL)语音模型的性能,并推动了研究用于语音表示学习。 SuperB演示语音SSL上游模型通过仅限最小的调整来提高各种下游任务的性能。由于自我监督学习上游模型的范式,其次是下游任务,在语音界引起更多关注,表征此类范例的对抗性稳健性是高优先级的。在本文中,我们首次尝试在零知识对手和有限知识对手的袭击下调查此类范例的对抗脆弱性。实验结果表明,Superb提出的范例严重易受有限的知识对手的影响,零知识对手产生的攻击是可转移性的。 XAB测试验证了制作的对抗性攻击的难以察觉。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译