学习3D对象类别的传统方法使用合成数据或手动监控。在本文中,我们提出了一种不需要手动注释的方法,而是通过观察来自移动的有利点的物体来阐述。我们的系统在两种创新上构建:暹罗视点分解网络,不太明确地比较3D形状,强大地对准不同的视频;和3D形状完成网络可以从部分观察中提取对象的完整形状。我们还展示了配置网络来执行概率预测以及几何感知数据增强方案的好处。我们在公开可用的基准上获得最先进的结果。
translated by 谷歌翻译
Understanding the 3D world without supervision is currently a major challenge in computer vision as the annotations required to supervise deep networks for tasks in this domain are expensive to obtain on a large scale. In this paper, we address the problem of unsupervised viewpoint estimation. We formulate this as a self-supervised learning task, where image reconstruction provides the supervision needed to predict the camera viewpoint. Specifically, we make use of pairs of images of the same object at training time, from unknown viewpoints, to self-supervise training by combining the viewpoint information from one image with the appearance information from the other. We demonstrate that using a perspective spatial transformer allows efficient viewpoint learning, outperforming existing unsupervised approaches on synthetic data, and obtains competitive results on the challenging PASCAL3D+ dataset.
translated by 谷歌翻译
由于真实的3D注释的类别数据的不可用,在合成数据集中,传统的学习3D对象类别的方法主要受到培训和评估。我们的主要目标是通过在与现有的合成对应物类似的幅度下收集现实世界数据来促进该领域的进步。因此,这项工作的主要贡献是一个大型数据集,称为3D中的常见对象,具有使用相机姿势和地面真相3D点云注释的对象类别的真实多视图图像。 DataSet总共包含从50 MS-Coco类别的近19,000个视频中捕获对象的150万帧,因此,在类别和对象的数量方面,它比替代更大。我们利用这款新数据集进行了几个新型综合和以类别为中心的3D重建方法的第一个大规模“野外”评估。最后,我们贡献了一种新型的神经渲染方法,它利用强大的变压器来重建对象,给出少量的视图。 CO3D DataSet可在HTTPS://github.com/facebookResearch/co3d获取。
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
我们呈现ROCA,一种新的端到端方法,可以从形状数据库到单个输入图像中检索并对齐3D CAD模型。这使得从2D RGB观察开始观察到的场景的3D感知,其特征在于轻质,紧凑,清洁的CAD表示。我们的方法的核心是我们基于密集的2D-3D对象对应关系和促使对齐的可差的对准优化。因此,罗卡可以提供强大的CAD对准,同时通过利用2D-3D对应关系来学习几何上类似CAD模型来同时通知CAD检索。SCANNET的真实世界图像实验表明,Roca显着提高了现有技术,从检索感知CAD准确度为9.5%至17.6%。
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译
深度学习识别的进步导致使用2D图像准确的对象检测。然而,这些2D感知方法对于完整的3D世界信息不足。同时,高级3D形状估计接近形状本身的焦点,而不考虑公制量表。这些方法无法确定对象的准确位置和方向。为了解决这个问题,我们提出了一个框架,该框架共同估计了从单个RGB图像的度量标度形状和姿势。我们的框架有两个分支:公制刻度对象形状分支(MSO)和归一化对象坐标空间分支(NOC)。 MSOS分支估计在相机坐标中观察到的度量标准形状。 NOCS分支预测归一化对象坐标空间(NOCS)映射,并从预测的度量刻度网格与渲染的深度图执行相似性转换,以获得6D姿势和大小。此外,我们介绍了归一化对象中心估计(NOCE),以估计从相机到物体中心的几何对齐距离。我们在合成和实际数据集中验证了我们的方法,以评估类别级对象姿势和形状。
translated by 谷歌翻译
我们为RGB视频提供了基于变压器的神经网络体系结构,用于多对象3D重建。它依赖于表示知识的两种替代方法:作为特征的全局3D网格和一系列特定的2D网格。我们通过专用双向注意机制在两者之间逐步交换信息。我们利用有关图像形成过程的知识,以显着稀疏注意力重量矩阵,从而使我们的体系结构在记忆和计算方面可行。我们在3D特征网格的顶部附上一个detr风格的头,以检测场景中的对象并预测其3D姿势和3D形状。与以前的方法相比,我们的体系结构是单阶段,端到端可训练,并且可以从整体上考虑来自多个视频帧的场景,而无需脆弱的跟踪步骤。我们在挑战性的SCAN2CAD数据集上评估了我们的方法,在该数据集中,我们的表现要优于RGB视频的3D对象姿势估算的最新最新方法; (2)将多视图立体声与RGB-D CAD对齐结合的强大替代方法。我们计划发布我们的源代码。
translated by 谷歌翻译
人类可以从少量的2D视图中从3D中感知场景。对于AI代理商,只有几个图像的任何视点识别场景的能力使它们能够有效地与场景及其对象交互。在这项工作中,我们试图通过这种能力赋予机器。我们提出了一种模型,它通过将新场景的几个RGB图像进行输入,并通过将其分割为语义类别来识别新的视点中的场景。所有这一切都没有访问这些视图的RGB图像。我们将2D场景识别与隐式3D表示,并从数百个场景的多视图2D注释中学习,而无需超出相机姿势的3D监督。我们试验具有挑战性的数据集,并展示我们模型的能力,共同捕捉新颖场景的语义和几何形状,具有不同的布局,物体类型和形状。
translated by 谷歌翻译
Training a Neural Radiance Field (NeRF) without pre-computed camera poses is challenging. Recent advances in this direction demonstrate the possibility of jointly optimising a NeRF and camera poses in forward-facing scenes. However, these methods still face difficulties during dramatic camera movement. We tackle this challenging problem by incorporating undistorted monocular depth priors. These priors are generated by correcting scale and shift parameters during training, with which we are then able to constrain the relative poses between consecutive frames. This constraint is achieved using our proposed novel loss functions. Experiments on real-world indoor and outdoor scenes show that our method can handle challenging camera trajectories and outperforms existing methods in terms of novel view rendering quality and pose estimation accuracy.
translated by 谷歌翻译
A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose. Our code and video are available at https://sites.google.com/view/densefusion/.
translated by 谷歌翻译
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. In common with recent work [10,14,16], we use an end-to-end learning approach with view synthesis as the supervisory signal. In contrast to the previous work, our method is completely unsupervised, requiring only monocular video sequences for training. Our method uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are thus coupled by the loss during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performs favorably compared to established SLAM systems under comparable input settings.
translated by 谷歌翻译
Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over stateof-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.
translated by 谷歌翻译
尽管在过去几年中取得了重大进展,但使用单眼图像进行深度估计仍然存在挑战。首先,训练度量深度预测模型的训练是不算气的,该预测模型可以很好地推广到主要由于训练数据有限的不同场景。因此,研究人员建立了大规模的相对深度数据集,这些数据集更容易收集。但是,由于使用相对深度数据训练引起的深度转移,现有的相对深度估计模型通常无法恢复准确的3D场景形状。我们在此处解决此问题,并尝试通过对大规模相对深度数据进行训练并估算深度转移来估计现场形状。为此,我们提出了一个两阶段的框架,该框架首先将深度预测到未知量表并从单眼图像转移,然后利用3D点云数据来预测深度​​移位和相机的焦距,使我们能够恢复恢复3D场景形状。由于两个模块是单独训练的,因此我们不需要严格配对的培训数据。此外,我们提出了图像级的归一化回归损失和基于正常的几何损失,以通过相对深度注释来改善训练。我们在九个看不见的数据集上测试我们的深度模型,并在零拍摄评估上实现最先进的性能。代码可用:https://git.io/depth
translated by 谷歌翻译
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage. 5
translated by 谷歌翻译
建立新型观点综合的最近进展后,我们提出了改善单眼深度估计的应用。特别是,我们提出了一种在三个主要步骤中分开的新颖训练方法。首先,单眼深度网络的预测结果被扭转到额外的视点。其次,我们应用一个额外的图像综合网络,其纠正并提高了翘曲的RGB图像的质量。通过最小化像素-WISE RGB重建误差,该网络的输出需要尽可能类似地查看地面真实性视图。第三,我们将相同的单眼深度估计重新应用于合成的第二视图点,并确保深度预测与相关的地面真理深度一致。实验结果证明,我们的方法在Kitti和Nyu-Deaft-V2数据集上实现了最先进的或可比性,具有轻量级和简单的香草U-Net架构。
translated by 谷歌翻译
6D对象姿势估计是计算机视觉和机器人研究中的基本问题之一。尽管最近在同一类别内将姿势估计概括为新的对象实例(即类别级别的6D姿势估计)方面已做出了许多努力,但考虑到有限的带注释数据,它仍然在受限的环境中受到限制。在本文中,我们收集了Wild6D,这是一种具有不同实例和背景的新的未标记的RGBD对象视频数据集。我们利用这些数据在野外概括了类别级别的6D对象姿势效果,并通过半监督学习。我们提出了一个新模型,称为呈现姿势估计网络reponet,该模型使用带有合成数据的自由地面真实性共同训练,以及在现实世界数据上具有轮廓匹配的目标函数。在不使用实际数据上的任何3D注释的情况下,我们的方法优于先前数据集上的最先进方法,而我们的WILD6D测试集(带有手动注释进行评估)则优于较大的边距。带有WILD6D数据的项目页面:https://oasisyang.github.io/semi-pose。
translated by 谷歌翻译
We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming to geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches and recent learning based methods.
translated by 谷歌翻译
我们提出了一种称为DPODV2(密集姿势对象检测器)的三个阶段6 DOF对象检测方法,该方法依赖于致密的对应关系。我们将2D对象检测器与密集的对应关系网络和多视图姿势细化方法相结合,以估计完整的6 DOF姿势。与通常仅限于单眼RGB图像的其他深度学习方法不同,我们提出了一个统一的深度学习网络,允许使用不同的成像方式(RGB或DEPTH)。此外,我们提出了一种基于可区分渲染的新型姿势改进方法。主要概念是在多个视图中比较预测并渲染对应关系,以获得与所有视图中预测的对应关系一致的姿势。我们提出的方法对受控设置中的不同数据方式和培训数据类型进行了严格的评估。主要结论是,RGB在对应性估计中表现出色,而如果有良好的3D-3D对应关系,则深度有助于姿势精度。自然,他们的组合可以实现总体最佳性能。我们进行广泛的评估和消融研究,以分析和验证几个具有挑战性的数据集的结果。 DPODV2在所有这些方面都取得了出色的成果,同时仍然保持快速和可扩展性,独立于使用的数据模式和培训数据的类型
translated by 谷歌翻译
从2D图像中学习可变形的3D对象通常是一个不适的问题。现有方法依赖于明确的监督来建立多视图对应关系,例如模板形状模型和关键点注释,这将其在“野外”中的对象上限制了。建立对应关系的一种更自然的方法是观看四处移动的对象的视频。在本文中,我们介绍了Dove,一种方法,可以从在线可用的单眼视频中学习纹理的3D模型,而无需关键点,视点或模板形状监督。通过解决对称性诱导的姿势歧义并利用视频中的时间对应关系,该模型会自动学会从每个单独的RGB框架中分解3D形状,表达姿势和纹理,并准备在测试时间进行单像推断。在实验中,我们表明现有方法无法学习明智的3D形状,而无需其他关键点或模板监督,而我们的方法在时间上产生了时间一致的3D模型,可以从任意角度来对其进行动画和呈现。
translated by 谷歌翻译