文本到图像模型最近通过光合现实质量看似准确的样本取得了巨大的成功。但是,随着最先进的语言模型仍在努力评估精确陈述,基于语言模型的图像生成过程也是如此。在这项工作中,我们展示了最先进的文本对图像模型(例如Dall-e)的问题,并通过与Draw基准基准相关的语句生成准确的样本。此外,我们表明剪辑无法始终如一地重新读取这些样品。为此,我们提出了Logicrank,这是一种神经符号推理框架,可以为这种精确要求设置提供更准确的排名系统。Logicrank平稳地集成到文本到图像模型的生成过程中,而且可以用于进一步调整更逻辑的精确模型。
translated by 谷歌翻译
While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.
translated by 谷歌翻译
我们介绍了自回归文本到图像(Parti)模型的途径,该模型生成高保真的影像图像并支持涉及复杂组成和世界知识的内容丰富的合成。 Parti将文本对图像生成视为类似于机器翻译的序列到序列建模问题,图像令牌的序列是目标输出,而不是其他语言的文本令牌。这种策略自然可以利用大型语言模型的先前工作,通过扩展数据和模型尺寸,能力和性能的持续进展。我们的方法很简单:首先,Parti使用基于变压器的图像令牌VIT-VQGAN将图像编码为离散令牌的序列。其次,我们通过将编码器二次变压器模型缩放到20B参数来实现一致的质量改进,其新的最新零弹药FID得分为7.23,而MS-Coco的FIDED得分为3.22。我们对本地化叙述以及党的详细分析(P2),这是1600多个英语提示的新的整体基准,证明了Parti在各种类别和难度方面的有效性。我们还探索并突出了我们的模型的局限性,以定义和体现关注重点领域以进一步改进。有关高分辨率图像,请参见https://parti.research.google/。
translated by 谷歌翻译
Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent large-scale text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a large-scale challenge dataset SR2D that contains sentences describing two objects and the spatial relationship between them. We construct and harness an automated evaluation pipeline that employs computer vision to recognize objects and their spatial relationships, and we employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although recent state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations such as left/right/above/below. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgment about spatial understanding. We offer the SR2D dataset and the VISOR metric to the community in support of T2I spatial reasoning research.
translated by 谷歌翻译
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译
我们周围的视觉世界可以被描述为结构化的对象和相关关系。只有在底层对象的描述及其相关关系的描述中,可以将房间的图像召唤。虽然在设计可能将各个物体组成的深度神经网络上进行了重大工作,但在构图对象之间的各个关系方面取得了更少的工作。主要困难是,虽然对象的放置是相互独立的,但它们的关系彼此纠缠并依赖。为了规避这个问题,现有的作品主要通过利用文本或图形的形式来通过利用整体编码器来构成关系。在这项工作中,我们建议将每个关系作为非正规化密度(基于能量的模型)表示,使我们能够以分解方式构成单独的关系。我们表明这种分解分解允许模型生成和编辑具有多组关系的场景更忠实地。我们进一步表明,分解使我们的模型能够有效地理解底层关系场景结构。项目页面:https://comushvisual relations.github.io/。
translated by 谷歌翻译
文本指导的图像生成模型,例如DALL-E 2和稳定的扩散,最近受到了学术界和公众的关注。这些模型提供了文本描述,能够生成描绘各种概念和样式的高质量图像。但是,此类模型接受了大量公共数据的培训,并从其培训数据中隐含地学习关系,这些数据并不明显。我们证明,可以通过简单地用视觉上类似的非拉丁字符替换文本描述中的单个字符来触发并注入生成的图像中的常见多模型模型,这些偏见可以被触发并注入生成的图像。这些所谓的同符文更换使恶意用户或服务提供商能够诱导偏见到生成的图像中,甚至使整个一代流程变得无用。我们实际上说明了对DALL-E 2和稳定扩散的这种攻击,例如文本引导的图像生成模型,并进一步表明夹子的行为也相似。我们的结果进一步表明,经过多语言数据训练的文本编码器提供了一种减轻同符替代效果的方法。
translated by 谷歌翻译
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
translated by 谷歌翻译
数字艺术合成在多媒体社区中受到越来越多的关注,因为有效地与公众参与了艺术。当前的数字艺术合成方法通常使用单模式输入作为指导,从而限制了模型的表现力和生成结果的多样性。为了解决这个问题,我们提出了多模式引导的艺术品扩散(MGAD)模型,该模型是一种基于扩散的数字艺术品生成方法,它利用多模式提示作为控制无分类器扩散模型的指导。此外,对比度语言图像预处理(剪辑)模型用于统一文本和图像模式。关于生成的数字艺术绘画质量和数量的广泛实验结果证实了扩散模型和多模式指导的组合有效性。代码可从https://github.com/haha-lisa/mgad-multimodal-guided-artwork-diffusion获得。
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译
我们提出了仅使用目标文本提示的3D模型的零击生成技术。在没有任何3D监督的情况下,我们的方法变形了极限细分表面的控制形状及其纹理地图和正常地图,以获得与输入文本提示相对应的3D资产,并且可以轻松地部署到游戏或建模应用程序中。我们仅依靠预先训练的剪辑模型,该模型将输入文本提示与我们3D模型的渲染图像进行了分化。虽然先前的作品集中在风格化或对生成模型的必要培训上,但我们直接对网格参数进行优化,以生成形状,纹理或两者兼而有之。为了限制优化以产生合理的网格和纹理,我们使用图像增强量引入了许多技术,并使用预验证的先验,该技术在给定文本嵌入的情况下生成了剪贴图像嵌入。
translated by 谷歌翻译
文本对图像模型提供了前所未有的自由,可以通过自然语言指导创作。然而,尚不清楚如何行使这种自由以生成特定独特概念,修改其外观或以新角色和新颖场景构成它们的图像。换句话说,我们问:我们如何使用语言指导的模型将猫变成绘画,或者想象基于我们喜欢的玩具的新产品?在这里,我们提出了一种简单的方法,可以允许这种创造性自由。我们仅使用3-5个用户提供的概念(例如对象或样式)的图像,我们学会通过在冷冻文本到图像模型的嵌入空间中通过新的“单词”表示它。这些“单词”可以组成自然语言句子,以直观的方式指导个性化的创作。值得注意的是,我们发现有证据表明单词嵌入足以捕获独特而多样的概念。我们将我们的方法比较了各种基线,并证明它可以更忠实地描绘出一系列应用程序和任务的概念。我们的代码,数据和新单词将在以下网址提供:https://textual-inversion.github.io
translated by 谷歌翻译
最近已被证明扩散模型产生高质量的合成图像,尤其是与指导技术配对,以促进忠诚的多样性。我们探索文本条件图像综合问题的扩散模型,并比较了两种不同的指导策略:剪辑指导和自由分类指导。我们发现后者是人类评估者的优选,用于光敏和标题相似度,并且通常产生光素质拟种样品。使用自由分类指导的35亿参数文本条件扩散模型的样本由人类评估者对来自Dall-E的人的人们青睐,即使后者使用昂贵的剪辑重新划分。此外,我们发现我们的模型可以进行微调,以执行图像修复,从而实现强大的文本驱动的图像编辑。我们在过滤的数据集中培训较小的模型,并在https://github.com/openai/glide-text2im释放代码和权重。
translated by 谷歌翻译
我们提出了快速的文本2stylegan,这是一种自然语言界面,可适应预先训练的甘体,以实现文本引导的人脸合成。利用对比性语言图像预训练(剪辑)的最新进展,在培训过程中不需要文本数据。Fast Text2Stylegan被配制为条件变异自动编码器(CVAE),可在测试时为生成的图像提供额外的控制和多样性。我们的模型在遇到新的文本提示时不需要重新训练或微调剂或剪辑。与先前的工作相反,我们不依赖于测试时间的优化,这使我们的方法数量级比先前的工作快。从经验上讲,在FFHQ数据集上,我们的方法提供了与先前的工作相比,自然语言描述中具有不同详细程度的自然语言描述中的图像。
translated by 谷歌翻译
关于文本到图像生成的研究在产生多样化和照片现实的图像方面取得了重大进展,这是由在大规模图像文本数据上训练的扩散和自动回归模型驱动的。尽管最先进的模型可以产生共同实体的高质量图像,但它们通常很难产生不常见的实体的图像,例如“ chortai(dog)”或“ picarones(食物)”。为了解决此问题,我们介绍了检索型的文本对图像生成器(Re-Imagen),这是一种生成模型,它使用检索到的信息来产生高保真和忠实的图像,即使对于稀有或看不见的实体也是如此。给定文本提示,重新构造访问外部多模式知识库以检索相关(图像,文本)对,并将它们用作引用来生成图像。通过此检索步骤,重新构造的知识是对上述实体的高级语义和低级视觉细节的了解,从而提高了其在产生实体视觉外观的准确性。我们在包含(图像,文本,检索)的构造数据集上训练Re-Imagen,以教导该模型在文本提示和检索上扎根。此外,我们制定了一种新的抽样策略,以使文本和检索条件的无分类指南交流,以平衡文本和检索对齐。 Re-Imagen在两个图像生成基准上获得了新的SOTA FID结果,例如Coco(IE,FID = 5.25)和Wikiimage(即FID = 5.82),而无需微调。为了进一步评估该模型的功能,我们介绍了EntityDrawBench,这是一种新的基准测试,可评估从多个视觉域的各种实体的图像生成,从频繁到稀有。人类对EntityDrawBench的评估表明,Re-Imagen与照片现实主义中最好的先前模型相同,但具有明显的忠诚,尤其是在较不频繁的实体上。
translated by 谷歌翻译
利用深度学习的最新进展,文本到图像生成模型目前具有吸引公众关注的优点。其中两个模型Dall-E 2和Imagen已经证明,可以从图像的简单文本描述中生成高度逼真的图像。基于一种称为扩散模型的新型图像生成方法,文本对图像模型可以生产许多不同类型的高分辨率图像,其中人类想象力是唯一的极限。但是,这些模型需要大量的计算资源来训练,并处理从互联网收集的大量数据集。此外,代码库和模型均未发布。因此,它可以防止AI社区尝试这些尖端模型,从而使其结果复制变得复杂,即使不是不可能。在本文中,我们的目标是首先回顾这些模型使用的不同方法和技术,然后提出我们自己的文本模型模型实施。高度基于DALL-E 2,我们引入了一些轻微的修改,以应对所引起的高计算成本。因此,我们有机会进行实验,以了解这些模型的能力,尤其是在低资源制度中。特别是,我们提供了比Dall-e 2的作者(包括消融研究)更深入的分析。此外,扩散模型使用所谓的指导方法来帮助生成过程。我们引入了一种新的指导方法,该方法可以与其他指导方法一起使用,以提高图像质量。最后,我们的模型产生的图像质量相当好,而不必维持最先进的文本对图像模型的重大培训成本。
translated by 谷歌翻译
Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also use LidarCLIP as a tool to investigate fundamental lidar capabilities through natural language. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. We hope LidarCLIP can inspire future work to dive deeper into connections between text and point cloud understanding. Code and trained models available at https://github.com/atonderski/lidarclip.
translated by 谷歌翻译
大型文本对图像模型在AI的演变中取得了显着的飞跃,从而使图像从给定的文本提示中实现了高质量和多样化的图像合成。但是,这些模型缺乏在给定的参考集中模仿受试者的外观,并在不同情况下合成它们的新颖性。在这项工作中,我们提出了一种新的方法,用于“个性化”文本图像扩散模型(将它们专门针对用户的需求)。仅作为一个主题的几张图像给出,我们将验证的文本对图像模型(图像,尽管我们的方法不限于特定模型),以便它学会了将唯一标识符与该特定主题结合。一旦将受试者嵌入模型的输出域中,就可以使用唯一标识符来合成主题的完全新颖的光真逼真的图像在不同场景中的上下文化。通过利用具有新的自动构基特异性的先前保存损失的语义先验嵌入到模型中,我们的技术可以在参考图像中未出现的不同场景,姿势,视图和照明条件中合成主题。我们将技术应用于几个以前无用的任务,包括主题重新定义,文本指导的视图合成,外观修改和艺术渲染(所有这些都保留了主题的关键特征)。项目页面:https://dreambooth.github.io/
translated by 谷歌翻译
通过使用图像文本匹配模型的使用,零光学习在计算机视觉中的应用已彻底改变。最值得注意的示例,剪辑,已广泛用于带有文本提示的零摄像分类和指导生成模型。但是,对于输入文本的措辞,夹子的零拍情况不稳定,因此有必要仔细设计所用的提示。我们发现这种不稳定性源于选择性相似性分数,该得分仅基于语义上有意义的输入令牌的子集。为了减轻它,我们提出了一种新颖的基于可解释的方法,该方法增加了损失术语,以确保剪辑专注于输入的所有相关语义部分,此外还采用了以前的作品中使用的夹子相似性损失。当通过及时的工程应用于单发分类时,我们的方法可以提高识别率,而无需进行额外的培训或微调。此外,我们表明使用我们的方法对生成模型的剪辑指导显着改善了生成的图像。最后,我们通过在对象位置进行空间条件来证明对基于文本的图像生成的新颖使用,这是需要将图像解释性热图限制在预定的边界框中。
translated by 谷歌翻译
Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.
translated by 谷歌翻译