Most speech enhancement (SE) models learn a point estimate, and do not make use of uncertainty estimation in the learning process. In this paper, we show that modeling heteroscedastic uncertainty by minimizing a multivariate Gaussian negative log-likelihood (NLL) improves SE performance at no extra cost. During training, our approach augments a model learning complex spectral mapping with a temporary submodel to predict the covariance of the enhancement error at each time-frequency bin. Due to unrestricted heteroscedastic uncertainty, the covariance introduces an undersampling effect, detrimental to SE performance. To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely undersampled components with more penalties. Our multivariate setting reveals common covariance assumptions such as scalar and diagonal matrices. By weakening these assumptions, we show that the NLL achieves superior performance compared to popular losses including the mean squared error (MSE), mean absolute error (MAE), and scale-invariant signal-to-distortion ratio (SI-SDR).
translated by 谷歌翻译
尽管基于深度学习的语音增强系统在提高语音信号的质量方面取得了迅速的进步,但它们仍然可以产生包含伪像且听起来不自然的输出。我们提出了一种新颖的语音增强方法,旨在通过优化言语的关键特征来提高增强信号的知觉质量和自然性。我们首先确定与语音质量良好相关的关键声学参数(例如抖动,微光和光谱通量),然后提出目标函数,旨在减少相对于这些功能的清洁语音和增强语音之间的差异。完整的声学特征是扩展的Geneva声学参数集(EGEMAPS),其中包括与语音感知相关的25种不同属性。考虑到这些功能计算的非差异性质,我们首先构建了EGEMAP的可区分估计器,然后使用它们来微调现有的语音增强系统。我们的方法是通用的,可以应用于任何现有的基于深度学习的增强系统,以进一步改善增强的语音信号。对深噪声抑制(DNS)挑战数据集进行的实验结果表明,我们的方法可以改善最新的基于深度学习的增强系统。
translated by 谷歌翻译
用于移动设备的有效神经网络骨干通常针对诸如FLOPS或参数计数之类的指标进行优化。但是,这些指标在移动设备上部署时可能与网络的延迟不太相关。因此,我们通过在移动设备上部署多个移动友好网络来对不同指标进行广泛的分析。我们在最近有效的神经网络中识别和分析建筑和优化瓶颈,并提供减轻这些瓶颈的方法。为此,我们设计了一个高效的骨干莫比尼蛋白,在iPhone12上的推理时间低于1毫秒,ImageNet上的Top-1精度为75.9%。我们表明,Mobileone在高效体系结构中实现了最先进的性能,同时在移动设备上的速度更快。我们的最佳模型在38倍的速度中,在Imagenet上的性能与移动形式相似。与在类似延迟时,我们的模型在ImageNet上获得了2.3%的TOP-1精度。此外,我们表明我们的模型概括为多个任务 - 图像分类,对象检测和语义分割,与在移动设备上部署时现有的有效体系结构相比,延迟和准确性的显着提高。
translated by 谷歌翻译
我们提出混音,这是一种简单而有效的自我监督方法,用于训练语音增强,而无需单个孤立的内域语音或噪声波形。我们的方法克服了以前的方法的局限性,这些方法使它们取决于清洁内域目标信号,因此,对火车和测试样品之间的任何域不匹配敏感。混音基于连续的自我训练方案,在该方案中,预先训练的教师模型涉及域外数据渗透者估计的伪靶信号,用于构域混合物。然后,通过将估计的清洁和噪声信号置换并将它们重新混合在一起,我们生成了一组新的自举混合物和相应的假目标,用于训练学生网络。反之亦然,教师使用最新学生模型的更新参数定期完善其估计。多个语音增强数据集和任务的实验结果不仅显示了我们方法比先前方法的优越性,而且还展示了混音可以与任何分离模型结合在一起,还可以应用于任何半监督和无监督的域适应任务。我们的分析与经验证据相结合,阐明了我们的自我训练方案的内部功能,其中学生模型在观察严重降级的伪靶标的情况下不断获得更好的性能。
translated by 谷歌翻译
从未标记数据的代表学习一直是对人工智能研究的重大兴趣。虽然自我监督的言语代表学习在语音研究界受欢迎,但很少有效地对非语音音频任务进行了全面分析了音频表示学习。在本文中,我们提出了一种自我监督的音频表示学习方法,并将其应用于各种下游非语音音频任务。我们将众所周知的Wav2Vec 2.0框架结合起来,这在用于语音任务的自我监督学习中取得了成功,具有参数效率的构装体系结构。我们的自我监督的预培训可以减少三分之二的标记数据的需求。在Audioset基准测试中,我们达到平均平均精度(地图)得分为0.415,这是通过仅限音频自我监督的学习在此数据集上的新型最先进的。我们的微调符合子也超越了在几个下游任务上以监督方式预先培训的先前系统的性能。我们进一步讨论了预先培训和微调的重要设计考虑因素。
translated by 谷歌翻译
Quadruped robots are currently used in industrial robotics as mechanical aid to automate several routine tasks. However, presently, the usage of such a robot in a domestic setting is still very much a part of the research. This paper discusses the understanding and virtual simulation of such a robot capable of detecting and understanding human emotions, generating its gait, and responding via sounds and expression on a screen. To this end, we use a combination of reinforcement learning and software engineering concepts to simulate a quadruped robot that can understand emotions, navigate through various terrains and detect sound sources, and respond to emotions using audio-visual feedback. This paper aims to establish the framework of simulating a quadruped robot that is emotionally intelligent and can primarily respond to audio-visual stimuli using motor or audio response. The emotion detection from the speech was not as performant as ERANNs or Zeta Policy learning, still managing an accuracy of 63.5%. The video emotion detection system produced results that are almost at par with the state of the art, with an accuracy of 99.66%. Due to its "on-policy" learning process, the PPO algorithm was extremely rapid to learn, allowing the simulated dog to demonstrate a remarkably seamless gait across the different cadences and variations. This enabled the quadruped robot to respond to generated stimuli, allowing us to conclude that it functions as predicted and satisfies the aim of this work.
translated by 谷歌翻译
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.
translated by 谷歌翻译
Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the noise of the dataset. To create the system, MOSES open-source SMT toolkit is explored. Distance reordering is utilized with the aim to understand the rules of grammar and context-dependent adjustments through a phrase reordering categorization framework. In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES
translated by 谷歌翻译
We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.
translated by 谷歌翻译
Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy of 80% and the CASTC model achieved an overall accuracy of 77.9%. We found that the cashew area in Benin has doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 70%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape.
translated by 谷歌翻译