We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration. This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene, without the BEV image and the calibration of the FPVs, while the output is a unified plane with the localization and orientation of both the subjects and cameras in a BEV. We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts: i) creating a view-transform subject detection module to transform the FPV to a virtual BEV including localization and orientation of each pedestrian, ii) deriving a geometric transformation based method to estimate camera localization and view direction, i.e., the camera registration in a unified BEV, iii) making use of spatial and appearance information to aggregate the subjects into the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for evaluation. The experimental results show the remarkable effectiveness of our proposed method.
translated by 谷歌翻译
最近的工作表明,大型审慎的语言模型(LMS)不仅可以在一系列自然语言处理(NLP)任务上表现出色,而且还可以开始改进推理任务,例如算术诱导,象征性操纵,并随着规模的增加而进行常识性推理。模型。但是,目前尚不清楚这些LMS的潜在能力是什么。令人惊讶的是,我们发现这些模型对某些基本的符号操纵任务有局限性,例如复制,反向和加法。当符号总数或重复符号增加时,模型性能会迅速下降。我们研究了这种现象背后的潜在原因,并检查了一组可能的方法,包括明确的位置标记,细粒度的计算步骤以及具有可呼出程序的LMS。实验结果表明,这些技术都无法完全解决最简单的添加感应问题。最后,我们向导师介绍LMS,这展示了每一个教学的步骤。 LMS带有导师的LMS能够在OOD和重复符号的情况下提供100%的精度,从而在诱导中对大型LMS边界产生新的见解。
translated by 谷歌翻译
Human group detection, which splits crowd of people into groups, is an important step for video-based human social activity analysis. The core of human group detection is the human social relation representation and division.In this paper, we propose a new two-stage multi-head framework for human group detection. In the first stage, we propose a human behavior simulator head to learn the social relation feature embedding, which is self-supervisely trained by leveraging the socially grounded multi-person behavior relationship. In the second stage, based on the social relation embedding, we develop a self-attention inspired network for human group detection. Remarkable performance on two state-of-the-art large-scale benchmarks, i.e., PANDA and JRDB-Group, verifies the effectiveness of the proposed framework. Benefiting from the self-supervised social relation embedding, our method can provide promising results with very few (labeled) training data. We will release the source code to the public.
translated by 谷歌翻译
我们提出了一种跨模型关注蒸馏框架,用于培训双编码器模型,用于了解视觉语言理解任务,例如视觉推理和视觉问题应答。双编码器模型的推理速度比Fusion-encoder模型更快,并在推理期间启用图像和文本的预算。然而,双编码器模型中使用的浅交互模块不足以处理复杂的视觉语言理解任务。为了学习图像和文本的深度互动,我们引入了跨模型注意蒸馏,它使用融合编码器模型的图像到文本和文本到图像注意力分布来指导我们的双编码器的培训模型。此外,我们表明,适用于预训练和微调阶段的跨模型注意蒸馏实现了进一步的改进。实验结果表明,蒸馏的双编码器模型可实现视觉推理,视觉征求和视觉问题的竞争性能,同时享受比Fusion-Conoder模型更快的推理速度。我们的代码和型号将在https://github.com/kugwzk/distilled -dualiCoder上公开提供。
translated by 谷歌翻译
Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoderdecoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (i.e., without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.
translated by 谷歌翻译
实现一般逆设计可以通过用户定义的属性极大地加速对新材料的发现。然而,最先进的生成模型往往限于特定的组成或晶体结构。这里,我们提出了一种能够一般逆设计的框架(不限于给定的一组元件或晶体结构),其具有在实际和往复空间中编码晶体的广义可逆表示,以及来自变分的属性结构潜空间autoencoder(vae)。在三种设计情况下,该框架通过用户定义的形成能量,带隙,热电(TE)功率因数和组合产生142个新晶体。在训练数据库中缺席的这些生成的晶体通过第一原理计算验证。成功率(验证的第一原理验证的目标圆形晶体/数量的设计晶体)范围为7.1%和38.9%。这些结果表示利用生成模型朝着性质驱动的一般逆设计的重要步骤,尽管在与实验合成结合时仍然存在实际挑战。
translated by 谷歌翻译
The success of deep learning heavily relies on large-scale data with comprehensive labels, which is more expensive and time-consuming to fetch in 3D compared to 2D images or natural languages. This promotes the potential of utilizing models pretrained with data more than 3D as teachers for cross-modal knowledge transferring. In this paper, we revisit masked modeling in a unified fashion of knowledge distillation, and we show that foundational Transformers pretrained with 2D images or natural languages can help self-supervised 3D representation learning through training Autoencoders as Cross-Modal Teachers (ACT). The pretrained Transformers are transferred as cross-modal 3D teachers using discrete variational autoencoding self-supervision, during which the Transformers are frozen with prompt tuning for better knowledge inheritance. The latent features encoded by the 3D teachers are used as the target of masked point modeling, wherein the dark knowledge is distilled to the 3D Transformer students as foundational geometry understanding. Our ACT pretrained 3D learner achieves state-of-the-art generalization capacity across various downstream benchmarks, e.g., 88.21% overall accuracy on ScanObjectNN. Codes will be released at https://github.com/RunpeiDong/ACT.
translated by 谷歌翻译
在广泛的应用中存在针刺问题,包括罕见疾病预测,生态资源管理,欺诈检测和材料特性优化。当相对于数据集大小的最佳条件存在极端不平衡时,就会出现针中的问题。例如,在开放式材料项目数据库中,在146K总材料中,只有0.82%的泊松比为负。但是,当前的最新优化算法并未设计出能够找到这些具有挑战性的多维针中问题的解决方案,从而导致与全球最佳或pige孔变为当地最低限度的缓慢收敛。在本文中,我们提出了一种基于缩放记忆的初始化算法,标题为Zombi,该算法构建了常规的贝叶斯优化原则,以在更少的时间和更少的实验中快速有效地优化针中的针刺问题,并通过解决常见的融合和常见的融合和较少的实验。鸽子问题。 Zombi从先前表现最佳的评估实验中积极提取知识,以在采样搜索范围内迭代放大到全局最佳的“针”,然后预留出低表现的历史实验的记忆,以加速计算时间。我们验证了该算法在两种现实世界中的5维针中的性能上的性能:发现辅助泊松比的发现和发现高热电图的优点材料的发现。与传统的贝叶斯优化相比,Zombi算法显示了400倍的计算时间加速度,并有效地发现了100个以下实验的材料,高达3倍的材料比当前最新算法发现的材料高度优化。
translated by 谷歌翻译
样本分配在现代对象检测方法中起着重要的作用。但是,大多数现有的方法都依靠手动设计来分配正 /负样本,这些样本并未明确建立样本分配和对象检测性能之间的关系。在这项工作中,我们提出了一种基于高参数搜索的新型动态样本分配方案。我们首先将分配给每个地面真理的正样本的数量定义为超参数,并采用替代优化算法来得出最佳选择。然后,我们设计一个动态的样本分配过程,以动态选择每个训练迭代中的最佳阳性数量。实验表明,所得的HPS-DET在不同对象检测基线的基线上带来了改善的性能。此外,我们分析了在不同数据集之间和不同骨架之间转移的高参数可重复使用性,以进行对象检测,这表现出我们方法的优势和多功能性。
translated by 谷歌翻译
基于参考的线路上色是计算机视觉中的一项具有挑战性的任务。颜色,纹理和阴影是根据抽象草图渲染的,该草图在很大程度上依赖于草图和参考之间的精确远程依赖模型。桥接跨模式信息并建模远程依赖性的流行技术采用了注意机制。但是,在基于参考的线路颜色化的背景下,几种技术将加剧现有的注意力训练困难,例如,自我监督的培训方案和基于GAN的损失。为了了解训练的不稳定,我们检测到注意力的梯度流并观察到注意力分支之间的梯度冲突。这种现象激发了我们通过在消除冲突阶段的同时保留主导梯度分支来减轻梯度问题。我们提出了一种使用这种训练策略,定格梯度注意(SGA)的新型注意机制,通过较大的边缘和更好的训练稳定性优于基线。与最新的线艺术色彩中的最新模块相比,我们的方法表明,FR \'Echet Inception距离(FID,最高27.21%)和结构相似性指数量度(SSIM,高达25.67%)的显着改善。几个基准。 SGA代码可从https://github.com/kunkun0w0/sga获得。
translated by 谷歌翻译