It has been observed in practice that applying pruning-at-initialization methods to neural networks and training the sparsified networks can not only retain the testing performance of the original dense models, but also sometimes even slightly boost the generalization performance. Theoretical understanding for such experimental observations are yet to be developed. This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. Specifically, this work considers a classification task for overparameterized two-layer neural networks, where the network is randomly pruned according to different rates at the initialization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero and the network exhibits good generalization performance. More surprisingly, the generalization bound gets better as the pruning fraction gets larger. To complement this positive result, this work further shows a negative result: there exists a large pruning fraction such that while gradient descent is still able to drive the training loss toward zero (by memorizing noise), the generalization performance is no better than random guessing. This further suggests that pruning can change the feature learning process, which leads to the performance drop of the pruned neural network. Up to our knowledge, this is the \textbf{first} generalization result for pruned neural networks, suggesting that pruning can improve the neural network's generalization.
translated by 谷歌翻译
Open Information Extraction (OIE) methods extract a large number of OIE triples (noun phrase, relation phrase, noun phrase) from text, which compose large Open Knowledge Bases (OKBs). However, noun phrases (NPs) and relation phrases (RPs) in OKBs are not canonicalized and often appear in different paraphrased textual variants, which leads to redundant and ambiguous facts. To address this problem, there are two related tasks: OKB canonicalization (i.e., convert NPs and RPs to canonicalized form) and OKB linking (i.e., link NPs and RPs with their corresponding entities and relations in a curated Knowledge Base (e.g., DBPedia). These two tasks are tightly coupled, and one task can benefit significantly from the other. However, they have been studied in isolation so far. In this paper, we explore the task of joint OKB canonicalization and linking for the first time, and propose a novel framework JOCL based on factor graph model to make them reinforce each other. JOCL is flexible enough to combine different signals from both tasks, and able to extend to fit any new signals. A thorough experimental study over two large scale OIE triple data sets shows that our framework outperforms all the baseline methods for the task of OKB canonicalization (OKB linking) in terms of average F1 (accuracy).
translated by 谷歌翻译
Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.
translated by 谷歌翻译
时间动作本地化旨在预测未修剪长视频中每个动作实例的边界和类别。基于锚或建议的大多数先前方法忽略了整个视频序列中的全局本地上下文相互作用。此外,他们的多阶段设计无法直接生成动作边界和类别。为了解决上述问题,本文提出了一种新颖的端到端模型,称为自适应感知变压器(简称apperformer)。具体而言,Adaperformer探索了双支球多头的自我发项机制。一个分支会照顾全球感知的关注,该注意力可以模拟整个视频序列并汇总全球相关环境。而其他分支集中于局部卷积转移,以通过我们的双向移动操作来汇总框架内和框架间信息。端到端性质在没有额外步骤的情况下产生视频动作的边界和类别。提供了广泛的实验以及消融研究,以揭示我们设计的有效性。我们的方法在Thumos14数据集上实现了最先进的准确性(根据map@0.5、42.6 \%map@0.7和62.7 \%map@avg),并在活动网络上获得竞争性能, -1.3数据集,平均地图为36.1 \%。代码和型号可在https://github.com/soupero/adaperformer上找到。
translated by 谷歌翻译
视频对象检测(VID)是具有挑战性的,因为对象外观的较高变化以及某些帧中的不同变化。在正面,与静止图像相比,视频的某个框架中的检测可以吸引其他帧的支撑。因此,如何在不同框架上汇总特征对于VID问题至关重要。大多数现有的聚合算法都是针对两阶段探测器定制的。但是,由于两阶段的性质,该类别中的探测器通常在计算上很昂贵。这项工作提出了一种简单而有效的策略来解决上述问题,该策略花费了很高的准确性上的边缘开销。具体而言,我们与传统的两阶段管道不同,我们主张在单阶段检测之后放置区域级别的选择,以避免处理大量的低质量候选者。此外,还构建了一个新的模块来评估目标框架及其参考的关系,并指导聚合。进行了广泛的实验和消融研究,以验证我们的设计功效,并揭示其优于其他最先进的VID方法的优势。我们的基于YOLOX的模型可以实现有希望的性能(例如,在单个2080TI GPU上的Imagenet VID数据集上的30 fps的87.5%AP50)使其对大规模或实时应用程序有吸引力。实现很简单,演示代码和模型已在https://github.com/yuhengsss/yolov上提供。
translated by 谷歌翻译
在目标属性下设计和生成新数据一直吸引着各种关键应用,例如分子设计,图像编辑和语音合成。传统手工制作的方法在很大程度上依赖于专业知识经验和强化人类的努力,但仍遭受科学知识和低吞吐量的不足,无法支持有效,有效的数据生成。最近,深度学习的进步引起了可以学习数据的基本表示和属性的表达方法。这种能力为弄清数据的结构模式和功能特性之间的相互关系提供了新的机会,并利用这种关系以生成所需属性的结构数据。本文对这个有前途的研究领域进行了系统的综述,通常称为可控制的深度数据生成。首先,提出了潜在的挑战,并提供了初步的挑战。然后,正式定义了可控的深度数据生成,提出了各种技术的分类法,并总结了该特定领域中的评估指标。之后,引入了可控制的深度数据生成的令人兴奋的应用程序,并对现有的作品进行了实验分析和比较。最后,突出显示了可控制的深度数据生成的有希望的未来方向,并确定了五个潜在的挑战。
translated by 谷歌翻译
用于对象检测的常规知识蒸馏(KD)方法主要集中于同质的教师学生探测器。但是,用于部署的轻质检测器的设计通常与高容量探测器显着不同。因此,我们研究了异构教师对之间的KD,以进行广泛的应用。我们观察到,异质KD(异核KD)的核心难度是由于不同优化的方式而导致异质探测器的主链特征之间的显着语义差距。常规的同质KD(HOMO-KD)方法遭受了这种差距的影响,并且很难直接获得异性KD的令人满意的性能。在本文中,我们提出了异助剂蒸馏(Head)框架,利用异质检测头作为助手来指导学生探测器的优化以减少此间隙。在头上,助手是一个额外的探测头,其建筑与学生骨干的老师负责人同质。因此,将异源KD转变为同性恋,从而可以从老师到学生的有效知识转移。此外,当训练有素的教师探测器不可用时,我们将头部扩展到一个无教师的头(TF-Head)框架。与当前检测KD方法相比,我们的方法已取得了显着改善。例如,在MS-COCO数据集上,TF-Head帮助R18视网膜实现33.9 MAP(+2.2),而Head将极限进一步推到36.2 MAP(+4.5)。
translated by 谷歌翻译
在这项工作中,我们为基于视觉的不均衡的BEV表示学习提出了PolarBev。为了适应摄像机成像的预先处理效果,我们将BEV空间横向和辐射上栅格化,并引入极性嵌入分解,以模拟极性网格之间的关联。极性网格被重新排列到类似阵列的常规表示,以进行有效处理。此外,为了确定2到3D对应关系,我们根据假设平面迭代更新BEV表面,并采用基于高度的特征转换。PolarBev在单个2080TI GPU上保持实时推理速度,并且在BEV语义分割和BEV实例分割方面都优于其他方法。展示彻底消融以验证设计。该代码将在\ url {https://github.com/superz-liu/polarbev}上发布。
translated by 谷歌翻译
良好的文本对图像模型不仅应生成高质量的图像,还应确保文本和生成图像之间的一致性。以前的型号无法同时很好地固定双方。本文提出了一个逐步的细化生成对抗网络(GR-GAN),以有效地减轻问题。 GRG模块的设计目的是生成从低分辨率到高分辨率的图像,并具有相应的文本约束,从粗粒度(句子)到细粒度(word)阶段,ITM模块旨在在两个句子上提供图像文本匹配的损失 - 相应阶段的图像级别和文字区域级别。我们还引入了一个新的度量跨模型距离(CMD),以同时评估图像质量和图像文本一致性。实验结果表明,GR-GAN显着的优于先前的模型,并在FID和CMD上实现了新的最新技术。详细的分析证明了GR-GAN不同产生阶段的效率。
translated by 谷歌翻译
非政策评估和学习(OPE/L)使用离线观察数据来做出更好的决策,这对于在线实验有限的应用至关重要。但是,完全取决于记录的数据,OPE/L对环境分布的变化很敏感 - 数据生成环境和部署策略的差异。 \ citet {si2020distributional}提议的分布在稳健的OPE/L(Drope/L)解决此问题,但该提案依赖于逆向权重,如果估计错误和遗憾,如果倾向是非参数估计的,即使其差异是次级估计,即使是次级估计的,其估计错误和遗憾将降低。对于标准的,非体,OPE/L,这是通过双重鲁棒(DR)方法来解决的,但它们并不自然地扩展到更复杂的drop/l,涉及最糟糕的期望。在本文中,我们提出了具有KL-Divergence不确定性集的DROPE/L的第一个DR算法。为了进行评估,我们提出了局部双重稳健的drope(LDR $^2 $ ope),并表明它在弱产品速率条件下实现了半摩托效率。多亏了本地化技术,LDR $^2 $ OPE仅需要安装少量回归,就像标准OPE的DR方法一样。为了学习,我们提出了连续的双重稳健下降(CDR $^2 $ opl),并表明,在涉及连续回归的产品速率条件下,它具有$ \ Mathcal {o} \ left的快速后悔率(n^) {-1/2} \ right)$即使未知的倾向是非参数估计的。我们从经验上验证了模拟中的算法,并将结果进一步扩展到一般$ f $ divergence的不确定性集。
translated by 谷歌翻译