In recent years, generative models have undergone significant advancement due to the success of diffusion models. The success of these models is often attributed to their use of guidance techniques, such as classifier and classifier-free methods, which provides effective mechanisms to trade-off between fidelity and diversity. However, these methods are not capable of guiding a generated image to be aware of its geometric configuration, e.g., depth, which hinders the application of diffusion models to areas that require a certain level of depth awareness. To address this limitation, we propose a novel guidance approach for diffusion models that uses estimated depth information derived from the rich intermediate representations of diffusion models. To do this, we first present a label-efficient depth estimation framework using the internal representations of diffusion models. At the sampling phase, we utilize two guidance techniques to self-condition the generated image using the estimated depth map, the first of which uses pseudo-labeling, and the subsequent one uses a depth-domain diffusion prior. Experiments and extensive ablation studies demonstrate the effectiveness of our method in guiding the diffusion models toward geometrically plausible image generation. Project page is available at https://ku-cvlab.github.io/DAG/.
translated by 谷歌翻译
Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without fine-tuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner.
translated by 谷歌翻译
Conditional diffusion probabilistic models can model the distribution of natural images and can generate diverse and realistic samples based on given conditions. However, oftentimes their results can be unrealistic with observable color shifts and textures. We believe that this issue results from the divergence between the probabilistic distribution learned by the model and the distribution of natural images. The delicate conditions gradually enlarge the divergence during each sampling timestep. To address this issue, we introduce a new method that brings the predicted samples to the training data manifold using a pretrained unconditional diffusion model. The unconditional model acts as a regularizer and reduces the divergence introduced by the conditional model at each sampling step. We perform comprehensive experiments to demonstrate the effectiveness of our approach on super-resolution, colorization, turbulence removal, and image-deraining tasks. The improvements obtained by our method suggest that the priors can be incorporated as a general plugin for improving conditional diffusion models.
translated by 谷歌翻译
与生成的对抗网(GAN)相比,降级扩散概率模型(DDPM)在各种图像生成任务中取得了显着成功。关于语义图像综合的最新工作主要遵循\ emph {de exto}基于gan的方法,这可能导致生成图像的质量或多样性不令人满意。在本文中,我们提出了一个基于DDPM的新型框架,用于语义图像合成。与先前的条件扩散模型不同,将语义布局和嘈杂的图像作为输入为U-NET结构,该结构可能无法完全利用输入语义掩码中的信息,我们的框架处理语义布局和嘈杂的图像不同。它将噪声图像馈送到U-NET结构的编码器时,而语义布局通过多层空间自适应归一化操作符将语义布局馈送到解码器。为了进一步提高语义图像合成中的发电质量和语义解释性,我们介绍了无分类器的指导采样策略,该策略承认采样过程的无条件模型的得分。在三个基准数据集上进行的广泛实验证明了我们提出的方法的有效性,从而在忠诚度(FID)和多样性〜(LPIPS)方面实现了最先进的性能。
translated by 谷歌翻译
Virtual reality and augmented reality (XR) bring increasing demand for 3D content. However, creating high-quality 3D content requires tedious work that a human expert must do. In this work, we study the challenging task of lifting a single image to a 3D object and, for the first time, demonstrate the ability to generate a plausible 3D object with 360{\deg} views that correspond well with the given reference image. By conditioning on the reference image, our model can fulfill the everlasting curiosity for synthesizing novel views of objects from images. Our technique sheds light on a promising direction of easing the workflows for 3D artists and XR designers. We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models. By introducing a ranking loss, our NeuralLift-360 can be guided with rough depth estimation in the wild. We also adopt a CLIP-guided sampling strategy for the diffusion prior to provide coherent guidance. Extensive experiments demonstrate that our NeuralLift-360 significantly outperforms existing state-of-the-art baselines. Project page: https://vita-group.github.io/NeuralLift-360/
translated by 谷歌翻译
denoisis扩散概率模型(DDPM)能够通过引入独立的噪声吸引分类器来在每次deosoing过程的时间步骤中提供条件梯度指导,从而使有条件的图像从先前的噪声到真实数据。但是,由于分类器能够轻松地区分不完全生成的图像仅具有高级结构的能力,因此梯度是一种类信息指导,倾向于尽早消失,导致从条件生成过程中崩溃到无条件过程。为了解决这个问题,我们从两个角度提出了两种简单但有效的方法。对于抽样程序,我们将预测分布的熵作为指导消失水平的度量,并提出一种熵感知的缩放方法,以适应性地恢复条件语义指导。每个生成样品的%。对于训练阶段,我们提出了熵吸引的优化目标,以减轻噪音数据的过度自信预测。在Imagenet1000 256x256中,我们提出的采样方案和训练有素的分类器(预训练的条件和无条件的DDPM模型可以实现10.89%(4.59至4.59至4.09))和43.5%(12至6.78)FID改善。
translated by 谷歌翻译
Semi-Supervised Learning (SSL) has recently accomplished successful achievements in various fields such as image classification, object detection, and semantic segmentation, which typically require a lot of labour to construct ground-truth. Especially in the depth estimation task, annotating training data is very costly and time-consuming, and thus recent SSL regime seems an attractive solution. In this paper, for the first time, we introduce a novel framework for semi-supervised learning of monocular depth estimation networks, using consistency regularization to mitigate the reliance on large ground-truth depth data. We propose a novel data augmentation approach, called K-way disjoint masking, which allows the network for learning how to reconstruct invisible regions so that the model not only becomes robust to perturbations but also generates globally consistent output depth maps. Experiments on the KITTI and NYU-Depth-v2 datasets demonstrate the effectiveness of each component in our pipeline, robustness to the use of fewer and fewer annotated images, and superior results compared to other state-of-the-art, semi-supervised methods for monocular depth estimation. Our code is available at https://github.com/KU-CVLAB/MaskingDepth.
translated by 谷歌翻译
可控图像合成模型允许根据文本指令或来自示例图像的指导创建不同的图像。最近,已经显示出去噪扩散概率模型比现有方法产生更现实的图像,并且已在无条件和类条件设置中成功展示。我们探索细粒度,连续控制该模型类,并引入了一种新颖的统一框架,用于语义扩散指导,允许语言或图像指导,或两者。使用图像文本或图像匹配分数的梯度将指导注入预训练的无条件扩散模型中。我们探讨基于剪辑的文本指导,以及以统一形式的基于内容和类型的图像指导。我们的文本引导综合方法可以应用于没有相关文本注释的数据集。我们对FFHQ和LSUN数据集进行实验,并显示出细粒度的文本引导图像合成的结果,与样式或内容示例图像相关的图像的合成,以及具有文本和图像引导的示例。
translated by 谷歌翻译
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128×128, 4.59 on ImageNet 256×256, and 7.72 on ImageNet 512×512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256×256 and 3.85 on ImageNet 512×512. We release our code at https://github.com/openai/guided-diffusion.
translated by 谷歌翻译
扩散模型(DMS)显示出高质量图像合成的巨大潜力。但是,当涉及到具有复杂场景的图像时,如何正确描述图像全局结构和对象细节仍然是一项具有挑战性的任务。在本文中,我们提出了弗里多(Frido),这是一种特征金字塔扩散模型,该模型执行了图像合成的多尺度粗到1个降解过程。我们的模型将输入图像分解为依赖比例的矢量量化特征,然后是用于产生图像输出的粗到细门。在上述多尺度表示阶段,可以进一步利用文本,场景图或图像布局等其他输入条件。因此,还可以将弗里多应用于条件或跨模式图像合成。我们对各种无条件和有条件的图像生成任务进行了广泛的实验,从文本到图像综合,布局到图像,场景环形图像到标签形象。更具体地说,我们在五个基准测试中获得了最先进的FID分数,即可可和开阔图像的布局到图像,可可和视觉基因组的场景环形图像以及可可的标签对图像图像。 。代码可在https://github.com/davidhalladay/frido上找到。
translated by 谷歌翻译
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
translated by 谷歌翻译
In this paper, we propose a diffusion-based face swapping framework for the first time, called DiffFace, composed of training ID conditional DDPM, sampling with facial guidance, and a target-preserving blending. In specific, in the training process, the ID conditional DDPM is trained to generate face images with the desired identity. In the sampling process, we use the off-the-shelf facial expert models to make the model transfer source identity while preserving target attributes faithfully. During this process, to preserve the background of the target image and obtain the desired face swapping result, we additionally propose a target-preserving blending strategy. It helps our model to keep the attributes of the target face from noise while transferring the source facial identity. In addition, without any re-training, our model can flexibly apply additional facial guidance and adaptively control the ID-attributes trade-off to achieve the desired results. To the best of our knowledge, this is the first approach that applies the diffusion model in face swapping task. Compared with previous GAN-based approaches, by taking advantage of the diffusion model for the face swapping task, DiffFace achieves better benefits such as training stability, high fidelity, diversity of the samples, and controllability. Extensive experiments show that our DiffFace is comparable or superior to the state-of-the-art methods on several standard face swapping benchmarks.
translated by 谷歌翻译
我们提出了Diffustereo,这是一种仅使用稀疏相机(在这项工作中8)进行高质量3D人类重建的新型系统。其核心是一种新型基于扩散的立体声模块,该模块将扩散模型(一种强大的生成模型)引入迭代立体声匹配网络中。为此,我们设计了一个新的扩散内核和其他立体限制,以促进网络中的立体声匹配和深度估计。我们进一步提出了一个多级立体声网络体系结构,以处理高分辨率(最多4K)输入,而无需无法负担的内存足迹。考虑到人类的一组稀疏视图颜色图像,提出的基于多级扩散的立体声网络可以产生高准确的深度图,然后通过有效的多视图融合策略将其转换为高质量的3D人类模型。总体而言,我们的方法可以自动重建人类模型,其质量是高端密集摄像头钻机,这是使用更轻巧的硬件设置来实现的。实验表明,我们的方法在定性和定量上都优于最先进的方法。
translated by 谷歌翻译
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.
translated by 谷歌翻译
Diffusion models have emerged as the state-of-the-art for image generation, among other tasks. Here, we present an efficient diffusion-based model for 3D-aware generation of neural fields. Our approach pre-processes training data, such as ShapeNet meshes, by converting them to continuous occupancy fields and factoring them into a set of axis-aligned triplane feature representations. Thus, our 3D training scenes are all represented by 2D feature planes, and we can directly train existing 2D diffusion models on these representations to generate 3D neural fields with high quality and diversity, outperforming alternative approaches to 3D-aware generation. Our approach requires essential modifications to existing triplane factorization pipelines to make the resulting features easy to learn for the diffusion model. We demonstrate state-of-the-art results on 3D generation on several object classes from ShapeNet.
translated by 谷歌翻译
在不同观点之间找到准确的对应关系是无监督的多视图立体声(MVS)的跟腱。现有方法是基于以下假设:相应的像素具有相似的光度特征。但是,在实际场景中,多视图图像观察到非斜面的表面和经验遮挡。在这项工作中,我们提出了一种新颖的方法,即神经渲染(RC-MVSNET),以解决观点之间对应关系的歧义问题。具体而言,我们施加了一个深度渲染一致性损失,以限制靠近对象表面的几何特征以减轻遮挡。同时,我们引入了参考视图综合损失,以产生一致的监督,即使是针对非兰伯特表面。关于DTU和TANKS \&Temples基准测试的广泛实验表明,我们的RC-MVSNET方法在无监督的MVS框架上实现了最先进的性能,并对许多有监督的方法进行了竞争性能。该代码在https://github.com/上发布。 BOESE0601/RC-MVSNET
translated by 谷歌翻译
数字艺术合成在多媒体社区中受到越来越多的关注,因为有效地与公众参与了艺术。当前的数字艺术合成方法通常使用单模式输入作为指导,从而限制了模型的表现力和生成结果的多样性。为了解决这个问题,我们提出了多模式引导的艺术品扩散(MGAD)模型,该模型是一种基于扩散的数字艺术品生成方法,它利用多模式提示作为控制无分类器扩散模型的指导。此外,对比度语言图像预处理(剪辑)模型用于统一文本和图像模式。关于生成的数字艺术绘画质量和数量的广泛实验结果证实了扩散模型和多模式指导的组合有效性。代码可从https://github.com/haha-lisa/mgad-multimodal-guided-artwork-diffusion获得。
translated by 谷歌翻译
Object compositing based on 2D images is a challenging problem since it typically involves multiple processing stages such as color harmonization, geometry correction and shadow generation to generate realistic results. Furthermore, annotating training data pairs for compositing requires substantial manual effort from professionals, and is hardly scalable. Thus, with the recent advances in generative models, in this work, we propose a self-supervised framework for object compositing by leveraging the power of conditional diffusion models. Our framework can hollistically address the object compositing task in a unified model, transforming the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling. To preserve the input object's characteristics, we introduce a content adaptor that helps to maintain categorical semantics and object appearance. A data augmentation method is further adopted to improve the fidelity of the generator. Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
translated by 谷歌翻译
我们提出了一种基于示例的图像翻译的新方法,称为匹配交织的扩散模型(MIDMS)。该任务的大多数现有方法都是基于GAN的匹配,然后代表了代代框架。但是,在此框架中,跨跨域的语义匹配难度引起的匹配误差,例如草图和照片,可以很容易地传播到生成步骤,从而导致结果退化。由于扩散模型的最新成功激发了克服GAN的缺点,我们结合了扩散模型以克服这些局限性。具体而言,我们制定了一个基于扩散的匹配和生成框架,该框架通过将中间扭曲馈入尖锐的过程并将其变形以生成翻译的图像,从而交织了潜在空间中的跨域匹配和扩散步骤。此外,为了提高扩散过程的可靠性,我们使用周期一致性设计了一种置信度的过程,以在翻译过程中仅考虑自信区域。实验结果表明,我们的MIDM比最新方法产生的图像更合理。
translated by 谷歌翻译
Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.
translated by 谷歌翻译