We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.
translated by 谷歌翻译
非自动进取的生成变压器最近表现出令人印象深刻的图像产生性能,并且比自动回归对应物更快。但是,从视觉令牌的真实关节分布中进行的最佳并行采样仍然是一个开放的挑战。在本文中,我们介绍了代币批评,这是一种辅助模型,用于指导非自动性生成变压器的采样。鉴于掩盖和重建的真实图像,对代币批判性模型进行了训练,以区分哪种视觉令牌属于原始图像,哪些是由生成变压器采样的。在非自动回归迭代采样过程中,令牌批评者用于选择要接受的代币以及拒绝和重新取样的代币。再加上最先进的生成变压器令牌 - 批判性可显着提高其性能,并且在挑战性的课堂条件化成像生成中,就产生的图像质量和多样性之间的权衡取舍了最近的扩散模型和gan 。
translated by 谷歌翻译
创建视觉布局是图形设计的重要步骤。当我们寻求比例和多样化的视觉设计时,这种布局的自动生成很重要。在自动布局的作品上,专注于无条件生成,其中模型在忽略用户需要进行特定问题的同时生成布局。为了提前有条件布局,我们介绍了BLT,双向布局变压器。 BLT与自回归解码不同,因为它首先生成满足用户输入的布局,然后迭代地改进布局。我们验证了具有各种保真度量的多个基准测试模型。我们的结果表明,最先进的布局变压器模型的两个主要进步。首先,我们的模型授权布局变压器来满足可控布局的制作。其次,我们的模型削减了自回归解码的线性推理时间达到恒定的复杂度,从而在推理时间以制定布局实现4x-10x的加速。
translated by 谷歌翻译
积极的数据增强是视觉变压器(VIT)的强大泛化能力的关键组成部分。一种这样的数据增强技术是对抗性培训;然而,许多先前的作品表明,这通常会导致清洁的准确性差。在这项工作中,我们展示了金字塔对抗训练,这是一种简单有效的技术来提高韦维尔的整体性能。我们将其与“匹配”辍学和随机深度正则化配对,这采用了干净和对抗样品的相同辍学和随机深度配置。类似于Advprop的CNNS的改进(不直接适用于VIT),我们的金字塔对抗性训练会破坏分销准确性和vit和相关架构的分配鲁棒性之间的权衡。当Imagenet-1K数据训练时,它导致ImageNet清洁准确性的182美元的vit-B模型的精确度,同时由7美元的稳健性指标同时提高性能,从$ 1.76 \%$至11.45 \%$。我们为Imagenet-C(41.4 MCE),Imagenet-R($ 53.92 \%$),以及Imagenet-Sketch(41.04美元\%$)的新的最先进,只使用vit-b / 16骨干和我们的金字塔对抗训练。我们的代码将在接受时公开提供。
translated by 谷歌翻译
我们使用条件扩散模型介绍调色板,这是一种简单而一般的框架,可用于图像到图像到图像转换。在四个具有挑战性的图像到图像转换任务(着色,染色,un折叠和JPEG减压),调色板优于强大的GaN和回归基线,并建立了新的最新状态。这是在没有特定于任务特定的超参数调整,架构定制或任何辅助损耗的情况下实现的,展示了理想的一般性和灵活性。我们揭示了使用$ l_2 $与vs. $ l_1 $损失在样本多样性上的越来越多的影响,并通过经验架构研究表明自我关注的重要性。重要的是,我们倡导基于想象项目的统一评估协议,并报告包括预先训练的Reset-50的FID,成立得分,分类准确度的多个样本质量评分,以及针对各种基线的参考图像的感知距离。我们预计这一标准化评估协议在推进图像到图像翻译研究方面发挥着关键作用。最后,我们表明,在3个任务(着色,染色,JPEG减压)上培训的单个通用调色板模型也表现或优于特定于任务专家的专家对应物。
translated by 谷歌翻译
尽管深度学习使图像介绍方面取得了巨大的飞跃,但当前的方法通常无法综合现实的高频细节。在本文中,我们建议将超分辨率应用于粗糙的重建输出,以高分辨率进行精炼,然后将输出降低到原始分辨率。通过将高分辨率图像引入改进网络,我们的框架能够重建更多的细节,这些细节通常由于光谱偏置而被平滑 - 神经网络倾向于比高频更好地重建低频。为了协助培训大型高度孔洞的改进网络,我们提出了一种渐进的学习技术,其中缺失区域的大小随着培训的进行而增加。我们的缩放,完善和缩放策略,结合了高分辨率的监督和渐进学习,构成了一种框架 - 不合时宜的方法,用于增强高频细节,可应用于任何基于CNN的涂层方法。我们提供定性和定量评估以及消融分析,以显示我们方法的有效性。这种看似简单但功能强大的方法优于最先进的介绍方法。我们的代码可在https://github.com/google/zoom-to-inpaint中找到
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
An unbiased scene graph generation (SGG) algorithm referred to as Skew Class-balanced Re-weighting (SCR) is proposed for considering the unbiased predicate prediction caused by the long-tailed distribution. The prior works focus mainly on alleviating the deteriorating performances of the minority predicate predictions, showing drastic dropping recall scores, i.e., losing the majority predicate performances. It has not yet correctly analyzed the trade-off between majority and minority predicate performances in the limited SGG datasets. In this paper, to alleviate the issue, the Skew Class-balanced Re-weighting (SCR) loss function is considered for the unbiased SGG models. Leveraged by the skewness of biased predicate predictions, the SCR estimates the target predicate weight coefficient and then re-weights more to the biased predicates for better trading-off between the majority predicates and the minority ones. Extensive experiments conducted on the standard Visual Genome dataset and Open Image V4 \& V6 show the performances and generality of the SCR with the traditional SGG models.
translated by 谷歌翻译
With the increasing ability of large language models (LLMs), in-context learning (ICL) has become a new paradigm for natural language processing (NLP), where LLMs make predictions only based on contexts augmented with a few training examples. It has been a new trend exploring ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress, challenges, and future work in ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques of ICL, including training strategies, prompting strategies, and so on. Finally, we present the challenges of ICL and provide potential directions for further research. We hope our work can encourage more research on uncovering how ICL works and improving ICL in future work.
translated by 谷歌翻译