Point cloud completion is a generation and estimation issue derived from the partial point clouds, which plays a vital role in the applications in 3D computer vision. The progress of deep learning (DL) has impressively improved the capability and robustness of point cloud completion. However, the quality of completed point clouds is still needed to be further enhanced to meet the practical utilization. Therefore, this work aims to conduct a comprehensive survey on various methods, including point-based, convolution-based, graph-based, and generative model-based approaches, etc. And this survey summarizes the comparisons among these methods to provoke further research insights. Besides, this review sums up the commonly used datasets and illustrates the applications of point cloud completion. Eventually, we also discussed possible research trends in this promptly expanding field.
translated by 谷歌翻译
最近,3D视觉和语言任务吸引了不断增长的研究兴趣。与其他视觉和语言任务相比,3D视觉问题回答(VQA)任务的利用较小,并且更容易受到语言先验和共同参考的歧义。同时,由于规模和注释方法有限,最近提出的几个3D VQA数据集并不能很好地支持3D VQA任务。在这项工作中,我们通过收集一个新的3D VQA数据集(称为FE-3DGQA),正式定义和解决3D接地的VQA任务,并具有多样化且相对自由形式的提问,以及密集和完全接地的边界框注释。为了获得更多可解释的答案,我们标记了出现在复杂的质量检查对中的对象,该对象具有不同的语义类型,包括答案接地的对象(均出现并未出现在问题中),以及用于答案的对象的上下文对象。我们还提出了一个新的3D VQA框架,以有效地预测完全视觉扎根和可解释的答案。广泛的实验证明,我们新收集的基准数据集可有效地用于评估不同方面的各种3D VQA方法,而我们新提出的框架也可以在新的基准数据集中实现最新的性能。新收集的数据集和我们的代码都将在http://github.com/zlccccc/3dgqa上公开获得。
translated by 谷歌翻译
在前景点(即物体)和室外激光雷达点云中的背景点之间通常存在巨大的失衡。它阻碍了尖端的探测器专注于提供信息的区域,以产生准确的3D对象检测结果。本文提出了一个新的对象检测网络,该对象检测网络通过称为PV-RCNN ++的语义点 - 素voxel特征相互作用。与大多数现有方法不同,PV-RCNN ++探索了语义信息,以增强对象检测的质量。首先,提出了一个语义分割模块,以保留更具歧视性的前景关键。这样的模块将指导我们的PV-RCNN ++在关键区域集成了更多与对象相关的点和体素特征。然后,为了使点和体素有效相互作用,我们利用基于曼哈顿距离的体素查询来快速采样关键点周围的体素特征。与球查询相比,这种体素查询将降低从O(N)到O(K)的时间复杂性。此外,为了避免仅学习本地特征,基于注意力的残留点网模块旨在扩展接收场,以将相邻的素素特征适应到关键点中。 Kitti数据集的广泛实验表明,PV-RCNN ++达到81.60 $ \%$,40.18 $ \%$,68.21 $ \%$ \%$ 3D地图在汽车,行人和骑自行车的人方面,可以在州,甚至可以在州立骑行者,甚至更好地绩效-艺术。
translated by 谷歌翻译
在本文中,我们解决了预测拥挤空间中的Egentric相机佩戴者(自我)的轨迹的问题。从现实世界中走向周围的不同相机佩戴者数据的数据学到的轨迹预测能力可以转移,以协助导航中的人们在导航中的人们障碍,并在移动机器人中灌输人类导航行为,从而实现更好的人机互动。为此,构建了一个新的Egocentric人类轨迹预测数据集,其中包含在佩戴相机的拥挤空间中导航的人们的真实轨迹,以及提取丰富的上下文数据。我们提取并利用三种不同的方式来预测摄像机佩戴者的轨迹,即他/她过去的轨迹,附近人的过去的轨迹以及场景语义或场景的深度等环境。基于变压器的编码器解码器神经网络模型,与熔化多种方式的新型级联跨关注机构集成,已经设计成预测相机佩戴者的未来轨迹。已经进行了广泛的实验,结果表明,我们的模型在Emocentric人类轨迹预测中优于最先进的方法。
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Given the increasingly intricate forms of partial differential equations (PDEs) in physics and related fields, computationally solving PDEs without analytic solutions inevitably suffers from the trade-off between accuracy and efficiency. Recent advances in neural operators, a kind of mesh-independent neural-network-based PDE solvers, have suggested the dawn of overcoming this challenge. In this emerging direction, Koopman neural operator (KNO) is a representative demonstration and outperforms other state-of-the-art alternatives in terms of accuracy and efficiency. Here we present KoopmanLab, a self-contained and user-friendly PyTorch module of the Koopman neural operator family for solving partial differential equations. Beyond the original version of KNO, we develop multiple new variants of KNO based on different neural network architectures to improve the general applicability of our module. These variants are validated by mesh-independent and long-term prediction experiments implemented on representative PDEs (e.g., the Navier-Stokes equation and the Bateman-Burgers equation) and ERA5 (i.e., one of the largest high-resolution data sets of global-scale climate fields). These demonstrations suggest the potential of KoopmanLab to be considered in diverse applications of partial differential equations.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译