We propose a methodology that systematically applies deep explanation algorithms on a dataset-wide basis, to compare different types of visual recognition backbones, such as convolutional networks (CNNs), global attention networks, and local attention networks. Examination of both qualitative visualizations and quantitative statistics across the dataset helps us to gain intuitions that are not just anecdotal, but are supported by the statistics computed on the entire dataset. Specifically, we propose two methods. The first one, sub-explanation counting, systematically searches for minimally-sufficient explanations of all images and count the amount of sub-explanations for each network. The second one, called cross-testing, computes salient regions using one network and then evaluates the performance by only showing these regions as an image to other networks. Through a combination of qualitative insights and quantitative statistics, we illustrate that 1) there are significant differences between the salient features of CNNs and attention models; 2) the occlusion-robustness in local attention models and global attention models may come from different decision-making mechanisms.
translated by 谷歌翻译
我们介绍了PointConvormer,这是一个基于点云的深神经网络体系结构的新颖构建块。受到概括理论的启发,PointConvormer结合了点卷积的思想,其中滤波器权重仅基于相对位置,而变形金刚则利用了基于功能的注意力。在PointConvormer中,附近点之间的特征差异是重量重量卷积权重的指标。因此,我们从点卷积操作中保留了不变,而注意力被用来选择附近的相关点进行卷积。为了验证PointConvormer的有效性,我们在点云上进行了语义分割和场景流估计任务,其中包括扫描仪,Semantickitti,FlyingThings3D和Kitti。我们的结果表明,PointConvormer具有经典的卷积,常规变压器和Voxelized稀疏卷积方法的表现,具有较小,更高效的网络。可视化表明,PointConvormer的性能类似于在平面表面上的卷积,而邻域选择效果在物体边界上更强,表明它具有两全其美。
translated by 谷歌翻译
视频对象细分(VOS)是视频理解的基础。基于变压器的方法在半监督VOS上显示出显着的性能改善。但是,现有的工作面临着挑战在彼此近距离接近视觉上类似对象的挑战。在本文中,我们提出了一种新型的双边注意力变压器,以进行半监督VO的运动出现空间(蝙蝠侠)。它通过新型的光流校准模块在视频中捕获对象运动,该模块将分割面膜与光流估计融合在一起,以改善对象内光流平滑度并减少物体边界处的噪声。然后在我们的新型双边注意力中采用了这种校准的光流,该流动流在相邻双边空间中的查询和参考帧之间的对应关系考虑,考虑到运动和外观。广泛的实验通过在所有四个流行的VOS基准上胜过所有现有最新的实验:YouTube-VOS 2019(85.0%),YouTube-VOS 2018(85.3%),Davis 2017VAL/TESTDEV(86.2.2 %/82.2%)和戴维斯(Davis)2016(92.5%)。
translated by 谷歌翻译
最近,变形金刚在空间范围内的学习和推断方面很受欢迎。但是,他们的性能依赖于存储并将注意力应用于视频中每个帧的功能张量。因此,随着视频的长度的增长,它们的空间和时间复杂性会线性增加,这对于长视频而言可能非常昂贵。我们提出了一种新颖的视觉记忆网络架构,用于空间范围的学习和推理问题。我们在内存网络中维护了固定的内存插槽,并提出了基于Gumbel-SoftMax的算法,以学习一种自适应策略以更新此内存。最后,该体系结构在视频对象细分(VOS)和视频预测问题上进行了基准测试。我们证明,我们的内存体系结构可实现最新的结果,在视频预测上优于基于变压器的方法和其他最新方法,同时保持恒定的内存能力与序列长度无关。
translated by 谷歌翻译
注意图是解释图像分类卷积网络的决策的流行方式。通常,对于感兴趣的每个图像,产生单一的注意图,其基于它们对分类的重要性分配给像素的权重。然而,单一的注意图提供了不完整的理解,因为通常有许多其他地图可以同样解释分类。在本文中,我们介绍了结构化的注意图(SAG),它通过捕获图像区域的不同组合影响分类器的信心来紧凑地代表图像的注意力映射。我们提出了一种方法来计算SAG和SAG的可视化,以便可以获得更深层次的洞察力进入分类器的决定。我们进行用户学习比较使用SAG对传统注意图的使用,以应对图像分类的反事实问题。我们的结果表明,当基于SAG与基线相比,用户在回答基于落下的比较反事实问题时更为正确。
translated by 谷歌翻译
Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and density functions through kernel density estimation. The most important contribution of this work is a novel reformulation proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure.
translated by 谷歌翻译
The existence of metallic implants in projection images for cone-beam computed tomography (CBCT) introduces undesired artifacts which degrade the quality of reconstructed images. In order to reduce metal artifacts, projection inpainting is an essential step in many metal artifact reduction algorithms. In this work, a hybrid network combining the shift window (Swin) vision transformer (ViT) and a convolutional neural network is proposed as a baseline network for the inpainting task. To incorporate metal information for the Swin ViT-based encoder, metal-conscious self-embedding and neighborhood-embedding methods are investigated. Both methods have improved the performance of the baseline network. Furthermore, by choosing appropriate window size, the model with neighborhood-embedding could achieve the lowest mean absolute error of 0.079 in metal regions and the highest peak signal-to-noise ratio of 42.346 in CBCT projections. At the end, the efficiency of metal-conscious embedding on both simulated and real cadaver CBCT data has been demonstrated, where the inpainting capability of the baseline network has been enhanced.
translated by 谷歌翻译
在骨科手术期间,通常在移动C臂系统下进行金属植入物或螺钉的插入。由于金属的衰减很大,因此在3D重建中发生了严重的金属伪像,从而极大地降低了图像质量。为了减少工件,已经开发了许多金属伪像还原算法,并且在投影域中涂上金属是必不可少的步骤。在这项工作中,基于分数的生成模型在模拟的膝关节投影上进行了训练,并通过在条件重采样过程中删除噪声来获得成分图像。结果暗示,与基于分数的生成模型对图像具有更详细的信息,并获得了与基于插值和基于CNN的方法相比,达到最低的平均绝对误差和最高峰值信号到噪声。此外,基于分数的模型还可以用大圆形和矩形掩模恢复预测,从而显示其在介入任务中的概括。
translated by 谷歌翻译
基准标记通常用于导航辅助微创脊柱手术(Miss),他们帮助将图像坐标转移到现实世界坐标中。在实践中,这些标记可能位于视野(FOV)之外,由于术中手术中使用的C形臂锥形束计算机断层扫描(CBCT)系统的有限检测器尺寸。因此,CBCT体积中的重建标记遭受伪影并且具有扭曲的形状,其设定了导航的障碍。在这项工作中,我们提出了两个基准标记检测方法:直接检测从失真标记(直接方法)和标记恢复后检测(恢复方法)。为了直接检测重构体积中的失真标记,提出了一种使用两个神经网络和传统圆检测算法的有效的自动标记检测方法。对于标记恢复,提出了一种特定于任务的学习策略,以从严重截断的数据中恢复标记。之后,施加传统的标记检测算法用于位置检测。在模拟数据和实际数据上评估这两种方法,两者都可以实现小于0.2mm的标记配准误差。我们的实验表明,直接方法能够准确地检测扭曲的标记,并且具有任务特定学习的恢复方法对各种数据集具有高的鲁棒性和概括性。此外,特定于任务的学习能够准确地重建其他感兴趣的结构结构,例如,用于图像引导针活检的肋骨,来自严重截断的数据,从而使CBCT系统具有新的潜在应用。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译