Most existing Image Restoration (IR) models are task-specific, which can not be generalized to different degradation operators. In this work, we propose the Denoising Diffusion Null-Space Model (DDNM), a novel zero-shot framework for arbitrary linear IR problems, including but not limited to image super-resolution, colorization, inpainting, compressed sensing, and deblurring. DDNM only needs a pre-trained off-the-shelf diffusion model as the generative prior, without any extra training or network modifications. By refining only the null-space contents during the reverse diffusion process, we can yield diverse results satisfying both data consistency and realness. We further propose an enhanced and robust version, dubbed DDNM+, to support noisy restoration and improve restoration quality for hard tasks. Our experiments on several IR tasks reveal that DDNM outperforms other state-of-the-art zero-shot IR methods. We also demonstrate that DDNM+ can solve complex real-world applications, e.g., old photo restoration.
translated by 谷歌翻译
如今,大规模数据集的大型培训大型模型已成为深度学习的关键主题。具有较高表示能力和可传递性的预训练模型取得了巨大的成功,并在自然语言处理和2D视觉中占据了许多下游任务。但是,鉴于有限的训练数据相对不便,因此将这种预处理的调整范式促进这种预处理的调整范式是非平凡的。在本文中,我们提供了一个新的观点,即利用3D域中的预训练的2D知识来解决此问题,以新颖的点对像素来调整预训练的图像模型,以较小的参数成本提示点云分析。遵循促使工程的原理,我们将点云转换为具有几何形状的投影和几何学吸引着色的色彩图像,以适应预训练的图像模型,在点云分析的端到端优化期间,其权重冻结了任务。我们进行了广泛的实验,以证明与提议的点对像素提示合作,更好的预训练图像模型将导致在3D视觉中始终如一地表现更好的性能。享受图像预训练领域的繁荣发展,我们的方法在Scanobjectnn的最困难环境中获得了89.3%的精度,超过了传统的点云模型,具有较少的可训练参数。我们的框架在模型网分类和塑形部分分割方面还表现出非常具竞争力的性能。代码可从https://github.com/wangzy22/p2p获得
translated by 谷歌翻译
最近,基于骨架的动作识别已经取得了快速进步和卓越的性能。在本文中,我们在跨数据集设置下调查了这个问题,这是现实情况下的新,务实且具有挑战性的任务。遵循无监督的域适应(UDA)范式,该动作标签仅在源数据集上可用,但在训练阶段的目标数据集中无法使用。与UDA的常规基于对抗性学习的方法不同,我们利用一个自学计划来减少两个基于骨架的动作数据集之间的域移动。我们的灵感来自Compism,Compism是20世纪初期的艺术类型,它破坏并重新组装了物体以传达更大的背景。通过分割和定制时间段或人体部位,我们设计了两个自制的学习分类任务,以探索基于骨架的动作的时间和空间依赖性,并提高模型的概括能力。我们在六个基于骨架的动作识别的数据集上进行实验,包括三个大规模数据集(NTU RGB+D,PKU-MMD和动力学),在其中建立了新的跨数据库设置和基准。广泛的结果表明,我们的方法优于最先进的方法。我们的模型和所有比较方法的源代码均可在https://github.com/shanice-l/st-cubism上获得。
translated by 谷歌翻译
我们呈现Point-Bert,一种用于学习变压器的新范式,以概括BERT对3D点云的概念。灵感来自BERT,我们将屏蔽点建模(MPM)任务设计为预列火车点云变压器。具体地,我们首先将点云划分为几个本地点修补程序,并且具有离散变化性AutoEncoder(DVAE)的点云标记器被设计为生成包含有意义的本地信息的离散点令牌。然后,我们随机掩盖了一些输入点云的补丁并将它们送入骨干变压器。预训练目标是在销售器获得的点代币的监督下恢复蒙面地点的原始点令牌。广泛的实验表明,拟议的BERT风格的预训练策略显着提高了标准点云变压器的性能。配备了我们的预培训策略,我们表明,纯变压器架构对ModelNet40的准确性为93.8%,在ScanObjectnn的最艰难的设置上的准确性为83.1%,超越精心设计的点云模型,手工制作的设计更少。我们还证明,Point-Bert从新的任务和域中获悉的表示,我们的模型在很大程度上推动了几个射击点云分类任务的最先进。代码和预先训练的型号可在https://github.com/lulutang0608/pint -bert上获得
translated by 谷歌翻译
Deep learning has revolutionized human society, yet the black-box nature of deep neural networks hinders further application to reliability-demanded industries. In the attempt to unpack them, many works observe or impact internal variables to improve the model's comprehensibility and transparency. However, existing methods rely on intuitive assumptions and lack mathematical guarantees. To bridge this gap, we introduce Bort, an optimizer for improving model explainability with boundedness and orthogonality constraints on model parameters, derived from the sufficient conditions of model comprehensibility and transparency. We perform reconstruction and backtracking on the model representations optimized by Bort and observe an evident improvement in model explainability. Based on Bort, we are able to synthesize explainable adversarial samples without additional parameters and training. Surprisingly, we find Bort constantly improves the classification accuracy of various architectures including ResNet and DeiT on MNIST, CIFAR-10, and ImageNet.
translated by 谷歌翻译
With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code will be publicly available at https://andytang15.github.io/FLAG3D.
translated by 谷歌翻译
With the rising industrial attention to 3D virtual modeling technology, generating novel 3D content based on specified conditions (e.g. text) has become a hot issue. In this paper, we propose a new generative 3D modeling framework called Diffusion-SDF for the challenging task of text-to-shape synthesis. Previous approaches lack flexibility in both 3D data representation and shape generation, thereby failing to generate highly diversified 3D shapes conforming to the given text descriptions. To address this, we propose a SDF autoencoder together with the Voxelized Diffusion model to learn and generate representations for voxelized signed distance fields (SDFs) of 3D shapes. Specifically, we design a novel UinU-Net architecture that implants a local-focused inner network inside the standard U-Net architecture, which enables better reconstruction of patch-independent SDF representations. We extend our approach to further text-to-shape tasks including text-conditioned shape completion and manipulation. Experimental results show that Diffusion-SDF is capable of generating both high-quality and highly diversified 3D shapes that conform well to the given text descriptions. Diffusion-SDF has demonstrated its superiority compared to previous state-of-the-art text-to-shape approaches.
translated by 谷歌翻译
Robotic force-based compliance control is a preferred approach to achieve high-precision assembly tasks. When the geometric features of assembly objects are asymmetric or irregular, reinforcement learning (RL) agents are gradually incorporated into the compliance controller to adapt to complex force-pose mapping which is hard to model analytically. Since force-pose mapping is strongly dependent on geometric features, a compliance controller is only optimal for current geometric features. To reduce the learning cost of assembly objects with different geometric features, this paper is devoted to answering how to reconfigure existing controllers for new assembly objects with different geometric features. In this paper, model-based parameters are first reconfigured based on the proposed Equivalent Theory of Compliance Law (ETCL). Then the RL agent is transferred based on the proposed Weighted Dimensional Policy Distillation (WDPD) method. The experiment results demonstrate that the control reconfiguration method costs less time and achieves better control performance, which confirms the validity of proposed methods.
translated by 谷歌翻译
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. Code is available at: https://github.com/Euphoria16/TL-Align.
translated by 谷歌翻译
The pretrain-finetune paradigm in modern computer vision facilitates the success of self-supervised learning, which tends to achieve better transferability than supervised learning. However, with the availability of massive labeled data, a natural question emerges: how to train a better model with both self and full supervision signals? In this paper, we propose Omni-suPErvised Representation leArning with hierarchical supervisions (OPERA) as a solution. We provide a unified perspective of supervisions from labeled and unlabeled data and propose a unified framework of fully supervised and self-supervised learning. We extract a set of hierarchical proxy representations for each image and impose self and full supervisions on the corresponding proxy representations. Extensive experiments on both convolutional neural networks and vision transformers demonstrate the superiority of OPERA in image classification, segmentation, and object detection. Code is available at: https://github.com/wangck20/OPERA.
translated by 谷歌翻译