Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
translated by 谷歌翻译
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. The code and data are available at https://github.com/pleaseconnectwifi/DANCE
translated by 谷歌翻译
The ability to associate touch with sight is essential for tasks that require physically interacting with objects in the world. We propose a dataset with paired visual and tactile data called Touch and Go, in which human data collectors probe objects in natural environments using tactile sensors, while simultaneously recording egocentric video. In contrast to previous efforts, which have largely been confined to lab settings or simulated environments, our dataset spans a large number of "in the wild" objects and scenes. To demonstrate our dataset's effectiveness, we successfully apply it to a variety of tasks: 1) self-supervised visuo-tactile feature learning, 2) tactile-driven image stylization, i.e., making the visual appearance of an object more consistent with a given tactile signal, and 3) predicting future frames of a tactile signal from visuo-tactile inputs.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
The ultimate goal of continuous sign language recognition(CSLR) is to facilitate the communication between special people and normal people, which requires a certain degree of real-time and deploy-ability of the model. However, in the previous research on CSLR, little attention has been paid to the real-time and deploy-ability. In order to improve the real-time and deploy-ability of the model, this paper proposes a zero parameter, zero computation temporal superposition crossover module(TSCM), and combines it with 2D convolution to form a "TSCM+2D convolution" hybrid convolution, which enables 2D convolution to have strong spatial-temporal modelling capability with zero parameter increase and lower deployment cost compared with other spatial-temporal convolutions. The overall CSLR model based on TSCM is built on the improved ResBlockT network in this paper. The hybrid convolution of "TSCM+2D convolution" is applied to the ResBlock of the ResNet network to form the new ResBlockT, and random gradient stop and multi-level CTC loss are introduced to train the model, which reduces the final recognition WER while reducing the training memory usage, and extends the ResNet network from image classification task to video recognition task. In addition, this study is the first in CSLR to use only 2D convolution extraction of sign language video temporal-spatial features for end-to-end learning for recognition. Experiments on two large-scale continuous sign language datasets demonstrate the effectiveness of the proposed method and achieve highly competitive results.
translated by 谷歌翻译
时间序列数据出现在各种应用程序中,例如智能运输和环境监测。时间序列分析的基本问题之一是时间序列预测。尽管最近的深度时间序列预测方法取得了成功,但它们仍需要足够的历史价值观察才能进行准确的预测。换句话说,输出长度(或预测范围)与输入和输出长度之和的比率应足够低(例如,0.3)。随着比率的增加(例如,到0.8),预测准确性的不确定性显着增加。在本文中,我们从理论和经验上都表明,通过将相关时间序列检索作为参考文献可以有效地降低不确定性。在理论分析中,我们首先量化不确定性,并显示其与平方误差(MSE)的连接。然后,我们证明,带有参考的模型比没有参考的模型更容易学习,因为检索到的参考可能会降低不确定性。为了凭经验证明基于检索的时间序列预测模型的有效性,我们引入了一种简单而有效的两阶段方法,称为“保留”,该方法由关系检索和内容合成组成。我们还表明,可以轻松地适应时空时间序列和时间序列插补设置。最后,我们评估了现实世界数据集上的延迟,以证明其有效性。
translated by 谷歌翻译
最近,基于得分的扩散模型在MRI重建中表现出令人满意的性能。这些方法中的大多数都需要大量完全采样的MRI数据作为培训集,有时在实践中很难获得。本文提出了用于MRI重建的完全采样的基于无DATA的分数扩散模型,该模型以不足的采样数据以自我监督的方式学习了完全采样的MR图像。具体而言,我们首先通过贝叶斯深度学习从未采样的数据中推断出完全采样的MR图像分布,然后通过训练分数函数来扰动数据分布并近似其概率密度梯度。利用学到的分数函数为先验,我们可以通过执行条件的Langevin Markov链蒙特卡洛(MCMC)采样来重建MR图像。公共数据集的实验表明,所提出的方法优于现有的自我监督的MRI重建方法,并与常规(完全采样的数据训练)基于得分的扩散方法实现可比性的性能。
translated by 谷歌翻译
深度神经网络可以捕获查询和文档之间的复杂交互历史信息,因为它们的许多复杂的非线性单元,使它们能够提供正确的搜索建议。但是,在现实情况下,服务提供商经常面临更复杂的障碍,例如部署成本限制和公平要求。已经提出了将训练有素的复杂模型(教师)转移到简单模型(学生)的知识的知识蒸馏,以减轻前者的关注,但最佳当前蒸馏方法仅着重于如何使学生模型模仿教师模型的预测。为了更好地促进深层模型的应用,我们建议基于知识蒸馏的公平信息检索框架。该框架可以改善模型的基于暴露的公平性,同时大大降低模型大小。我们在三个巨大数据集上进行的广泛实验表明,我们提出的框架可以将模型尺寸降低到其原始尺寸的最小1%,同时保持其黑盒状态。它还将公平性能提高15%〜46%,同时保持高水平的建议效率。
translated by 谷歌翻译
网络对齐或在不同网络中查找相应节点的任务是许多应用程序域中的重要问题。我们提出了CAPER,这是一个多级比对框架,它使输入图更粗糙,将粗糙的图对齐,将对齐解决方案投射到更精细的级别并完善对齐解决方案。我们表明,CAPER可以通过在多个图形分辨率上执行对齐一致性来改善许多不同的现有网络对齐算法:在更较高的级别上,匹配的节点也应在更粗的级别匹配。CAPER还可以通过允许其在输入图的较小的粗糙版本上运行,以线性时间的粗化和完善步骤的适度成本加速使用较慢的网络对齐方法。实验表明,刺山柑可以对不同的网络比对方法的改进,其准确性和/或运行时的准确性和/或数量级平均可以提高33%。
translated by 谷歌翻译
对比度学习是图表学习中的有效无监督方法,对比度学习的关键组成部分在于构建正和负样本。以前的方法通常利用图中节点的接近度作为原理。最近,基于数据增强的对比度学习方法已进步以显示视觉域中的强大力量,一些作品将此方法从图像扩展到图形。但是,与图像上的数据扩展不同,图上的数据扩展远不那么直观,而且很难提供高质量的对比样品,这为改进留出了很大的空间。在这项工作中,通过引入一个对抗性图视图以进行数据增强,我们提出了一种简单但有效的方法,对抗图对比度学习(ARIEL),以在合理的约束中提取信息性的对比样本。我们开发了一种称为稳定训练的信息正则化的新技术,并使用子图抽样以进行可伸缩。我们通过将每个图形实例视为超级节点,从节点级对比度学习到图级。 Ariel始终优于在现实世界数据集上的节点级别和图形级分类任务的当前图对比度学习方法。我们进一步证明,面对对抗性攻击,Ariel更加强大。
translated by 谷歌翻译