NLP系统的解释性方法遇到了因果推断的基本问题的版本:对于给定的基础真相输入文本,我们从未真正观察到隔离模型表示对输出的因果影响所必需的反事实文本。作为回应,许多解释性方法不使用反事实文本,假设它们将是不可用的。在本文中,我们表明可以使用近似反事实来创建强大的因果解释方法,该方法可以由人类写成近似特定的反事实或简单地使用元数据指导的启发式启发式启示术进行采样。我们提案的核心是因果替代模型(CPM)。 CPM解释了一个黑框$ \ Mathcal {n} $,因为它经过培训可以具有与$ \ Mathcal {n} $相同的实际输入/输出行为,而创建可以介入的神经表示,以模拟反事实输入/$ \ MATHCAL {N} $的输出行为。此外,我们证明了$ \ Mathcal {n} $的最佳CPM在做出事实预测时性能与$ \ Mathcal {n} $相当地执行,这意味着CPM可以简单地替换$ \ Mathcal {n} $,从而导致更多信息可解释的部署模型。我们的代码可在https://github.com/frankaging/causal-proxy-model上找到。
translated by 谷歌翻译
蒸馏工作导致语言模型更紧凑,没有严重的性能下降。蒸馏的标准方法培训了针对两个目标的学生模型:特定于任务的目标(例如,语言建模)和模仿目标,并鼓励学生模型的隐藏状态与较大的教师模型类似。在本文中,我们表明,增强蒸馏有利于第三个目标,鼓励学生通过交换干预培训(IIT)来模仿教师的因果计算过程。 IIT推动学生模型成为教师模型的因果抽象 - 一种具有相同因果结构的更简单的模型。 IIT是完全可差异的,容易实施,并与其他目标灵活结合。与伯特标准蒸馏相比,通过IIT蒸馏导致维基百科(屏蔽语言建模)逐步困惑,并对胶水基准(自然语言理解),队(问题接听)和Conll-2003(命名实体识别)进行了改进。
translated by 谷歌翻译
在许多领域,我们有很好的了解有关导致结构的洞察,这将使我们训练有素的型号有用,同时仍然可以以数据驱动的方式学习。为此,我们介绍了交换干预培训的新方法(IIT)。在IIT中,我们(1)与神经模型中的表示的因果模型中的变量和(2)列车在一个神经模型中,以匹配当两个模型中的对齐表示时的基本输入上的因果模型的反事行为它们是第二源输入的值。 IIT完全可分辨,灵活地与其他目标结合,并保证目标因果模型是当其损失最小化时神经模型的ACAUSAL抽象。我们在结构化视觉任务(MNIST-PVR)和导航指令任务(REARCAN)上评估IIT。我们将IIT与多任务培训目标和数据增强进行比较。在我们的所有实验中,IIT在他们实现目标因果模型的意义上实现了最佳结果,并产生了更可观的诠释。
translated by 谷歌翻译
We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.
translated by 谷歌翻译
Neural fields have revolutionized the area of 3D reconstruction and novel view synthesis of rigid scenes. A key challenge in making such methods applicable to articulated objects, such as the human body, is to model the deformation of 3D locations between the rest pose (a canonical space) and the deformed space. We propose a new articulation module for neural fields, Fast-SNARF, which finds accurate correspondences between canonical space and posed space via iterative root finding. Fast-SNARF is a drop-in replacement in functionality to our previous work, SNARF, while significantly improving its computational efficiency. We contribute several algorithmic and implementation improvements over SNARF, yielding a speed-up of $150\times$. These improvements include voxel-based correspondence search, pre-computing the linear blend skinning function, and an efficient software implementation with CUDA kernels. Fast-SNARF enables efficient and simultaneous optimization of shape and skinning weights given deformed observations without correspondences (e.g. 3D meshes). Because learning of deformation maps is a crucial component in many 3D human avatar methods and since Fast-SNARF provides a computationally efficient solution, we believe that this work represents a significant step towards the practical creation of 3D virtual humans.
translated by 谷歌翻译
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
translated by 谷歌翻译
发生毁灭性事件后,数十年来仍然可以看到空袭的后果。未爆炸的军械(UXO)是对人类生活和环境的巨大危险。通过评估战时图像,专家可以推断出DUD的发生。当前的手动分析过程是昂贵且耗时的,因此使用深度学习可以自动检测炸弹陨石坑,是改善UXO处置过程的一种有希望的方法。但是,这些方法需要大量手动标记的培训数据。这项工作利用月球表面图像来利用域的适应性,以解决自动化炸弹火山口检测的问题,并在有限的训练数据的限制下深入学习。本文通过提供有限的训练数据和(2)的自动炸弹火山口检测的解决方案方法来促进学术和实践(1),并通过证明使用合成图像进行域适应的可用性和相关挑战。
translated by 谷歌翻译
我们提出了E3NN,这是一个通用框架,用于创建E(3)e术训练功能,也称为欧几里得神经网络。E3NN自然地在几何和几何张量上进行操作,这些几何和几何张量描述了3D中的系统,并在坐标系统的变化下可预测地转换。E3NN的核心是诸如张力生产类别或球形谐波函数之类的等效操作,这些功能可以组成,以创建更复杂的模块,例如卷积和注意机制。E3NN的这些核心操作可用于有效地阐明张量球场网络,3D可通道的CNN,Clebsch-Gordan Networks,SE(3)变压器和其他E(3)E(3)Equivariant网络。
translated by 谷歌翻译
最先进的3D感知生成模型依赖于基于坐标的MLP来参数化3D辐射场。在证明令人印象深刻的结果的同时,请查询每个沿每个射线样品的MLP,都会导致渲染缓慢。因此,现有方法通常会呈现低分辨率特征图,并通过UPSMPLING网络处理以获取最终图像。尽管有效,神经渲染通常纠缠于观点和内容,从而改变摄像头会导致几何或外观的不必要变化。在基于体素的新型视图合成中的最新结果中,我们研究了本文中稀疏体素电网表示的快速和3D一致生成建模的实用性。我们的结果表明,当将稀疏体素电网与渐进式生长,自由空间修剪和适当的正则化结合时,单层MLP确实可以被3D卷积代替。为了获得场景的紧凑表示并允许缩放到更高的体素分辨率,我们的模型将前景对象(以3D模型)从背景(以2D模型建模)中。与现有方法相反,我们的方法仅需要单个正向通行证来生成完整的3D场景。因此,它允许从任意观点呈现有效渲染,同时以高视觉保真度产生3D一致的结果。
translated by 谷歌翻译
具有高质量注释的大规模培训数据对于训练语义和实例分割模型至关重要。不幸的是,像素的注释是劳动密集型且昂贵的,从而提高了对更有效的标签策略的需求。在这项工作中,我们提出了一种新颖的3D到2D标签传输方法,即Panoptic Nerf,该方法旨在从易于体现的粗3D边界原始基原始素中获取每个像素2D语义和实例标签。我们的方法利用NERF作为可区分的工具来统一从现有数据集中传输的粗3D注释和2D语义提示。我们证明,这种组合允许通过语义信息指导的几何形状,从而使跨多个视图的准确语义图渲染。此外,这种融合过程解决了粗3D注释的标签歧义,并过滤了2D预测中的噪声。通过推断3D空间并渲染到2D标签,我们的2D语义和实例标签是按设计一致的多视图。实验结果表明,在挑战Kitti-360数据集的挑战性城市场景方面,Pastic Nerf的表现优于现有标签传输方法。
translated by 谷歌翻译