Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3Daware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Project page: https://snap-research.github.io/discoscene/
translated by 谷歌翻译
Video generation requires synthesizing consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs). First, towards composing adjacent frames, we show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality. Second, by incorporating the temporal shift module (TSM), originally designed for video understanding, into the discriminator, we manage to advance the generator in synthesizing more consistent dynamics. Third, we develop a novel B-Spline based motion representation to ensure temporal smoothness to achieve infinite-length video generation. It can go beyond the frame number used in training. A low-rank temporal modulation is also proposed to alleviate repeating contents for long video generation. We evaluate our approach on various datasets and show substantial improvements over video generation baselines. Code and models will be publicly available at https://genforce.github.io/StyleSV.
translated by 谷歌翻译
Generative adversarial network (GAN) is formulated as a two-player game between a generator (G) and a discriminator (D), where D is asked to differentiate whether an image comes from real data or is produced by G. Under such a formulation, D plays as the rule maker and hence tends to dominate the competition. Towards a fairer game in GANs, we propose a new paradigm for adversarial training, which makes G assign a task to D as well. Specifically, given an image, we expect D to extract representative features that can be adequately decoded by G to reconstruct the input. That way, instead of learning freely, D is urged to align with the view of G for domain classification. Experimental results on various datasets demonstrate the substantial superiority of our approach over the baselines. For instance, we improve the FID of StyleGAN2 from 4.30 to 2.55 on LSUN Bedroom and from 4.04 to 2.82 on LSUN Church. We believe that the pioneering attempt present in this work could inspire the community with better designed generator-leading tasks for GAN improvement.
translated by 谷歌翻译
Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and models will be made publicly available.
translated by 谷歌翻译
Generative models, as an important family of statistical modeling, target learning the observed data distribution via generating new instances. Along with the rise of neural networks, deep generative models, such as variational autoencoders (VAEs) and generative adversarial network (GANs), have made tremendous progress in 2D image synthesis. Recently, researchers switch their attentions from the 2D space to the 3D space considering that 3D data better aligns with our physical world and hence enjoys great potential in practice. However, unlike a 2D image, which owns an efficient representation (i.e., pixel grid) by nature, representing 3D data could face far more challenges. Concretely, we would expect an ideal 3D representation to be capable enough to model shapes and appearances in details, and to be highly efficient so as to model high-resolution data with fast speed and low memory cost. However, existing 3D representations, such as point clouds, meshes, and recent neural fields, usually fail to meet the above requirements simultaneously. In this survey, we make a thorough review of the development of 3D generation, including 3D shape generation and 3D-aware image synthesis, from the perspectives of both algorithms and more importantly representations. We hope that our discussion could help the community track the evolution of this field and further spark some innovative ideas to advance this challenging task.
translated by 谷歌翻译
通过区分真实和合成样品,鉴别器在训练生成对抗网络(GAN)中起着至关重要的作用。尽管实际数据分布保持不变,但由于发电机的发展,合成分布一直变化,从而影响鉴别器的BI分类任务的相应变化。我们认为,对其容量进行即时调整的歧视者可以更好地适应这种时间变化的任务。一项全面的实证研究证实,所提出的培训策略称为Dynamicd,改善了合成性能,而不会产生任何其他计算成本或培训目标。在不同的数据制度下开发了两个容量调整方案,用于培训gan:i)给定足够数量的培训数据,歧视者从逐渐增加的学习能力中受益,ii)ii)当培训数据受到限制时,逐渐减少层宽度的宽度减轻。歧视者的过度问题。在一系列数据集上进行的2D和3D感知图像合成任务的实验证实了我们的动力学的普遍性以及对基准的实质性改进。此外,Dynamicd与其他歧视器改进方法(包括数据增强,正规化器和预训练)具有协同作用,并且在将学习gans合并时会带来连续的性能增长。
translated by 谷歌翻译
自我训练在半监督学习中表现出巨大的潜力。它的核心思想是使用在标记数据上学习的模型来生成未标记样本的伪标签,然后自我教学。为了获得有效的监督,主动尝试通常会采用动量老师进行伪标签的预测,但要观察确认偏见问题,在这种情况下,错误的预测可能会提供错误的监督信号并在培训过程中积累。这种缺点的主要原因是,现行的自我训练框架充当以前的知识指导当前状态,因为老师仅与过去的学生更新。为了减轻这个问题,我们提出了一种新颖的自我训练策略,该策略使模型可以从未来学习。具体而言,在每个培训步骤中,我们都会首先优化学生(即,在不将其应用于模型权重的情况下缓存梯度),然后用虚拟未来的学生更新老师,最后要求老师为伪标记生产伪标签目前的学生作为指导。这样,我们设法提高了伪标签的质量,从而提高了性能。我们还通过深入(FST-D)和广泛(FST-W)窥视未来,开发了我们未来自我训练(FST)框架的两个变体。将无监督的域自适应语义分割和半监督语义分割的任务作为实例,我们在广泛的环境下实验表明了我们方法的有效性和优越性。代码将公开可用。
translated by 谷歌翻译
尽管无监督的异常检测迅速发展,但现有的方法仍需要训练不同对象的单独模型。在这项工作中,我们介绍了完成具有统一框架的多个类别的异常检测。在如此具有挑战性的环境下,流行的重建网络可能属于“相同的快捷方式”,在这种捷径中,正常样本和异常样本都可以很好地恢复,因此无法发现异常值。为了解决这一障碍,我们取得了三个改进。首先,我们重新审视完全连接的层,卷积层以及注意力层的配方,并确认查询嵌入(即注意层内)在防止网络学习快捷键方面的重要作用。因此,我们提出了一个层的查询解码器,以帮助建模多级分布。其次,我们采用一个邻居掩盖的注意模块,以进一步避免从输入功能到重建的输出功能的信息泄漏。第三,我们提出了一种功能抖动策略,即使使用嘈杂的输入,也敦促模型恢复正确的消息。我们在MVTEC-AD和CIFAR-10数据集上评估了我们的算法,在该数据集中,我们通过足够大的利润率超过了最先进的替代方案。例如,当在MVTEC-AD中学习15个类别的统一模型时,我们在异常检测的任务(从88.1%到96.5%)和异常定位(从89.5%到96.8%)上超过了第二个竞争者。代码将公开可用。
translated by 谷歌翻译
反转生成对抗网络(GAN)可以使用预训练的发电机来促进广泛的图像编辑任务。现有方法通常采用gan的潜在空间作为反转空间,但观察到空间细节的恢复不足。在这项工作中,我们建议涉及发电机的填充空间,以通过空间信息补充潜在空间。具体来说,我们替换具有某些实例感知系数的卷积层中使用的恒定填充(例如,通常为零)。通过这种方式,可以适当地适当地适应了预训练模型中假定的归纳偏差以适合每个单独的图像。通过学习精心设计的编码器,我们设法在定性和定量上提高了反演质量,超过了现有的替代方案。然后,我们证明了这样的空间扩展几乎不会影响天然甘纳的歧管,因此我们仍然可以重复使用甘斯(Gans)对各种下游应用学到的先验知识。除了在先前的艺术中探讨的编辑任务外,我们的方法还可以进行更灵活的图像操纵,例如对面部轮廓和面部细节的单独控制,并启用一种新颖的编辑方式,用户可以高效地自定义自己的操作。
translated by 谷歌翻译
尽管在生成对抗网络(GAN)的潜在空间中,语义发现迅速发展,但现有方法要么仅限于找到全局属性,要么依靠许多细分掩码来识别本地属性。在这项工作中,我们提出了一种高效的算法,以分解甘恩学到的关于任意图像区域的潜在语义。具体而言,我们重新审视了预先训练的gan的局部操纵任务,并将基于区域的语义发现作为双重优化问题。通过适当定义的广义雷利商,我们设法解决了这个问题,而无需任何注释或培训。对各种最先进的GAN模型的实验结果证明了我们的方法的有效性,以及它优于先前艺术在精确控制,区域鲁棒性,实施速度和使用简单性方面的优势。
translated by 谷歌翻译