由于使用特征提取过程中的每个框架,基于变压器的语音识别模型取得了巨大的成功。尤其是,下层中的SA头通过查询键点产品捕获了各种语音特性,该产品旨在计算帧之间的成对关系。在本文中,我们提出了一种SA的变体来提取更多代表性的语音特征。提出的语音自我注意力(PHSA)由两种不同类型的语音注意组成。一个是基于相似性的,另一个是基于内容的。简而言之,基于相似性的注意力捕获了帧之间的相关性,而基于内容的注意力仅考虑每个帧而不会受到其他帧影响。我们确定原始点产品方程的哪些部分与两种不同的注意力模式有关,并通过简单的修改改善每个部分。我们关于音素分类和语音识别的实验表明,用PHSA代替下层SA可改善识别性能,而无需增加延迟和参数大小。
translated by 谷歌翻译
广义零射击学习(GZSL)是一种培训深度学习模型的技术,可以使用该属性识别未经看的类。在本文中,我们提出了一种新的GZSL技术,可以大大提高GZSL分类性能。所提出的方法的关键思想,从此将来称为基于语义特征提取的GZSL(SE-GZSL),是使用仅包含与属性相关信息的语义特征在学习图像和属性之间的关系时。在这样做时,我们可以删除由图像特征中包含的属性无关信息引起的干扰(如果有的话)。为了训练网络提取语义特征,我们呈现了两个新颖的丢失功能,1)基于信息的丢失,以捕获图像特征中的所有属性相关信息,以及基于相似性的丢失来删除不需要的属性 - 无关信息。通过使用各种数据集的大量实验,我们表明所提出的SE-GZSL技术优于传统的GZSL方法,通过大边距。
translated by 谷歌翻译
How can we accurately identify new memory workloads while classifying known memory workloads? Verifying DRAM (Dynamic Random Access Memory) using various workloads is an important task to guarantee the quality of DRAM. A crucial component in the process is open-set recognition which aims to detect new workloads not seen in the training phase. Despite its importance, however, existing open-set recognition methods are unsatisfactory in terms of accuracy since they fail to exploit the characteristics of workload sequences. In this paper, we propose Acorn, an accurate open-set recognition method capturing the characteristics of workload sequences. Acorn extracts two types of feature vectors to capture sequential patterns and spatial locality patterns in memory access. Acorn then uses the feature vectors to accurately classify a subsequence into one of the known classes or identify it as the unknown class. Experiments show that Acorn achieves state-of-the-art accuracy, giving up to 37% points higher unknown class detection accuracy while achieving comparable known class classification accuracy than existing methods.
translated by 谷歌翻译
Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods.
translated by 谷歌翻译
Thanks to the development of 2D keypoint detectors, monocular 3D human pose estimation (HPE) via 2D-to-3D uplifting approaches have achieved remarkable improvements. Still, monocular 3D HPE is a challenging problem due to the inherent depth ambiguities and occlusions. To handle this problem, many previous works exploit temporal information to mitigate such difficulties. However, there are many real-world applications where frame sequences are not accessible. This paper focuses on reconstructing a 3D pose from a single 2D keypoint detection. Rather than exploiting temporal information, we alleviate the depth ambiguity by generating multiple 3D pose candidates which can be mapped to an identical 2D keypoint. We build a novel diffusion-based framework to effectively sample diverse 3D poses from an off-the-shelf 2D detector. By considering the correlation between human joints by replacing the conventional denoising U-Net with graph convolutional network, our approach accomplishes further performance improvements. We evaluate our method on the widely adopted Human3.6M and HumanEva-I datasets. Comprehensive experiments are conducted to prove the efficacy of the proposed method, and they confirm that our model outperforms state-of-the-art multi-hypothesis 3D HPE methods.
translated by 谷歌翻译
Generalized Labeled Multi-Bernoulli (GLMB) densities arise in a host of multi-object system applications analogous to Gaussians in single-object filtering. However, computing the GLMB filtering density requires solving NP-hard problems. To alleviate this computational bottleneck, we develop a linear complexity Gibbs sampling framework for GLMB density computation. Specifically, we propose a tempered Gibbs sampler that exploits the structure of the GLMB filtering density to achieve an $\mathcal{O}(T(P+M))$ complexity, where $T$ is the number of iterations of the algorithm, $P$ and $M$ are the number hypothesized objects and measurements. This innovation enables an $\mathcal{O}(T(P+M+\log(T))+PM)$ complexity implementation of the GLMB filter. Convergence of the proposed Gibbs sampler is established and numerical studies are presented to validate the proposed GLMB filter implementation.
translated by 谷歌翻译
Traversability estimation for mobile robots in off-road environments requires more than conventional semantic segmentation used in constrained environments like on-road conditions. Recently, approaches to learning a traversability estimation from past driving experiences in a self-supervised manner are arising as they can significantly reduce human labeling costs and labeling errors. However, the self-supervised data only provide supervision for the actually traversed regions, inducing epistemic uncertainty according to the scarcity of negative information. Negative data are rarely harvested as the system can be severely damaged while logging the data. To mitigate the uncertainty, we introduce a deep metric learning-based method to incorporate unlabeled data with a few positive and negative prototypes in order to leverage the uncertainty, which jointly learns using semantic segmentation and traversability regression. To firmly evaluate the proposed framework, we introduce a new evaluation metric that comprehensively evaluates the segmentation and regression. Additionally, we construct a driving dataset `Dtrail' in off-road environments with a mobile robot platform, which is composed of a wide variety of negative data. We examine our method on Dtrail as well as the publicly available SemanticKITTI dataset.
translated by 谷歌翻译
Single-image 3D human reconstruction aims to reconstruct the 3D textured surface of the human body given a single image. While implicit function-based methods recently achieved reasonable reconstruction performance, they still bear limitations showing degraded quality in both surface geometry and texture from an unobserved view. In response, to generate a realistic textured surface, we propose ReFu, a coarse-to-fine approach that refines the projected backside view image and fuses the refined image to predict the final human body. To suppress the diffused occupancy that causes noise in projection images and reconstructed meshes, we propose to train occupancy probability by simultaneously utilizing 2D and 3D supervisions with occupancy-based volume rendering. We also introduce a refinement architecture that generates detail-preserving backside-view images with front-to-back warping. Extensive experiments demonstrate that our method achieves state-of-the-art performance in 3D human reconstruction from a single image, showing enhanced geometry and texture quality from an unobserved view.
translated by 谷歌翻译
Recently, numerous studies have investigated cooperative traffic systems using the communication among vehicle-to-everything (V2X). Unfortunately, when multiple autonomous vehicles are deployed while exposed to communication failure, there might be a conflict of ideal conditions between various autonomous vehicles leading to adversarial situation on the roads. In South Korea, virtual and real-world urban autonomous multi-vehicle races were held in March and November of 2021, respectively. During the competition, multiple vehicles were involved simultaneously, which required maneuvers such as overtaking low-speed vehicles, negotiating intersections, and obeying traffic laws. In this study, we introduce a fully autonomous driving software stack to deploy a competitive driving model, which enabled us to win the urban autonomous multi-vehicle races. We evaluate module-based systems such as navigation, perception, and planning in real and virtual environments. Additionally, an analysis of traffic is performed after collecting multiple vehicle position data over communication to gain additional insight into a multi-agent autonomous driving scenario. Finally, we propose a method for analyzing traffic in order to compare the spatial distribution of multiple autonomous vehicles. We study the similarity distribution between each team's driving log data to determine the impact of competitive autonomous driving on the traffic environment.
translated by 谷歌翻译
为了在非结构化环境中安全,成功地导航自动驾驶汽车,地形的穿越性应根据车辆的驾驶能力而变化。实际的驾驶经验可以以自我监督的方式使用来学习特定的轨迹。但是,现有的学习自我监督的方法对于学习各种车辆的遍历性并不可扩展。在这项工作中,我们引入了一个可扩展的框架,用于学习自我监督的遍历性,该框架可以直接从车辆 - 泰林的互动中学习遍历性,而无需任何人类监督。我们训练一个神经网络,该神经网络可以预测车辆从3D点云中经历的本体感受体验。使用一种新颖的PU学习方法,网络同时确定了不可转化的区域,其中估计可以过度自信。通过从模拟和现实世界中收集的各种车辆的驾驶数据,我们表明我们的框架能够学习各种车辆的自我监督的越野性。通过将我们的框架与模型预测控制器整合在一起,我们证明了估计的遍历性会导致有效的导航,从而根据车辆的驾驶特性实现了不同的操作。此外,实验结果验证了我们方法识别和避免不可转化区域的能力。
translated by 谷歌翻译