Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at \url{https://github.com/kakaobrain/noc}.
translated by 谷歌翻译
尽管卷积神经网络(CNN)的演变发展,但它们的性能令人惊讶地取决于超参数的选择。但是,由于现代CNN的较长训练时间,有效探索大型超参数搜索空间仍然具有挑战性。多保真优化可以通过提前终止无主张的配置来探索更多的超参数配置。但是,它通常会导致选择亚最佳配置作为训练,并在早期阶段通常会缓慢收敛。在本文中,我们提出了具有重复学习率(MORL)的多余性优化,该率将CNNS的优化过程纳入了多性效率优化。莫尔减轻了缓慢启动的问题,并实现了更精确的低保真近似。我们对一般图像分类,转移学习和半监督学习的全面实验证明了MORL对其他多保真优化方法的有效性,例如连续减半算法(SHA)和HyperBand。此外,它可以在实际预算内进行手工调整的超参数配置的显着性能改进。
translated by 谷歌翻译
图池是用于编码图中层次结构的关键操作。大多数现有的图形池方法将问题作为节点聚类任务提出,从而有效捕获图形拓扑。常规方法要求用户指定适当数量的簇作为超参数,然后假设所有输入图共享相同数量的簇。但是,在簇数可以变化的归纳设置中,该模型应能够表示其池层中的这种变化,以学习合适的簇。因此,我们提出了GMPool,这是一种新型可区分的图形池体系结构,该体系结构会根据输入数据自动确定适当数量的簇数。主要直觉涉及定义为合并操作员的二次形式的分组矩阵,该矩阵诱导了节点成对组合的二进制分类概率的使用。 GMPool首先计算分组矩阵,然后将其分解。对分子财产预测任务的广泛评估表明,我们的方法表现优于常规方法。
translated by 谷歌翻译
半监督视频对象细分(VOS)旨在密集跟踪视频中的某些指定对象。该任务中的主要挑战之一是存在与目标对象相似的背景干扰物的存在。我们提出了三种抑制此类干扰因素的新型策略:1)一种时空多元化的模板构建方案,以获得目标对象的广义特性; 2)可学习的距离得分函数,可通过利用两个连续帧之间的时间一致性来排除空间距离的干扰因素; 3)交换和连接的扩展通过提供包含纠缠对象的训练样本来迫使每个对象具有独特的功能。在所有公共基准数据集中,即使是实时性能,我们的模型也与当代最先进的方法相当。定性结果还证明了我们的方法优于现有方法。我们认为,我们的方法将被广泛用于未来的VOS研究。
translated by 谷歌翻译
我们表明,没有图形特异性修改的标准变压器可以在理论和实践中都带来图形学习的有希望的结果。鉴于图,我们只是将所有节点和边缘视为独立的令牌,用令牌嵌入增强它们,然后将它们馈入变压器。有了适当的令牌嵌入选择,我们证明这种方法在理论上至少与不变的图形网络(2-ign)一样表达,由等效线性层组成,它已经比所有消息传播的图形神经网络(GNN)更具表现力)。当在大规模图数据集(PCQM4MV2)上接受训练时,与具有精致的图形特异性电感偏置相比,与GNN基准相比,与GNN基准相比,与GNN基准相比,与GNN基准相比,我们创造的令牌化图形变压器(Tokengt)取得了明显更好的结果。我们的实施可从https://github.com/jw9730/tokengt获得。
translated by 谷歌翻译
在许多数据域中,关于对象的关节外观的共同发生统计是有源地信息的。通过将无监督的学习问题转换成共同发生统计的分解,光谱算法为后验题分析和社区检测提供透明和有效的算法。然而,由于对象词汇表生长,存储和运行应对统计数据的推理算法迅速更昂贵。整改共同发生,秉承模型假设的关键过程在罕见术语存在下变得越来越重要,但目前的技术不能扩展到大号词汇。我们提出了新的方法,即同时压缩和纠正共发生统计,优雅地缩放词汇量和潜在空间的维度。我们还提出了从压缩统计数据的新算法学习潜在变量,并验证了我们的方法是否相当于文本和非文本数据的先前方法。
translated by 谷歌翻译
半监控视频对象分割(VOS)旨在跟踪像素级别的视频初始帧中存在的指定对象。为了充分利用对象的外观信息,像素级别匹配广泛用于VOS。传统的特征匹配以样式方式运行,即,仅考虑从查询帧到参考帧的最佳匹配。查询框中的每个位置是指参考帧中的最佳位置,而不管每个参考帧位置的频率如何。在大多数情况下,这效果很好,并且对快速外观变化是强大的,但是当查询框架包含看起来类似于目标对象的后台分散组时可能会导致严重错误。为了缓解这一问题,我们介绍了一种自由派匹配机制,找到从查询帧到参考帧的最佳匹配,反之亦然。在查找查询帧像素的最佳匹配之前,首先考虑用于参考帧像素的最佳匹配以防止每个参考帧像素被过度参考。由于该机制以严格的方式操作,即,如果才能彼此确定匹配,则连接像素,因此可以有效地消除背景干扰器。此外,我们提出了一个掩模嵌入模块,以改善现有的掩模传播方法。通过使用坐标信息嵌入多个历史掩模,可以有效地捕获目标对象的位置信息。
translated by 谷歌翻译
计算机视觉和机器学习中的许多问题都可以作为代表高阶关系的超图的学习。 HyperGraph Learning的最新方法基于消息传递扩展了图形神经网络,这在建模远程依赖性和表达能力方面很简单但根本上有限。另一方面,基于张量的模棱两可的神经网络具有最大的表现力,但是由于沉重的计算和对固定顺序超中件的严格假设,它们的应用受到了超图的限制。我们解决了这些问题,并目前呈现了模棱两可的HyperGraph神经网络(EHNN),这是实现一般超图学习最大表达性的层的首次尝试。我们还提出了基于超网(EHNN-MLP)和自我注意力(EHNN-TransFormer)的两个实用实现,这些实现易于实施,理论上比大多数消息传递方法更具表现力。我们证明了它们在一系列超图学习问题中的能力,包括合成K边缘识别,半监督分类和视觉关键点匹配,并报告对强烈消息传递基线的改进性能。我们的实施可从https://github.com/jw9730/ehnn获得。
translated by 谷歌翻译
The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.
translated by 谷歌翻译
Cellular automata (CA) captivate researchers due to teh emergent, complex individualized behavior that simple global rules of interaction enact. Recent advances in the field have combined CA with convolutional neural networks to achieve self-regenerating images. This new branch of CA is called neural cellular automata [1]. The goal of this project is to use the idea of idea of neural cellular automata to grow prediction machines. We place many different convolutional neural networks in a grid. Each conv net cell outputs a prediction of what the next state will be, and minimizes predictive error. Cells received their neighbors' colors and fitnesses as input. Each cell's fitness score described how accurate its predictions were. Cells could also move to explore their environment and some stochasticity was applied to movement.
translated by 谷歌翻译