Image-text retrieval (ITR) is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. In recent years, researchers have made great progress in exploring the accurate alignment between image and text. However, existing works mainly focus on the fine-grained alignment between image regions and sentence fragments, which ignores the guiding significance of context background information. Actually, integrating the local fine-grained information and global context background information can provide more semantic clues for retrieval. In this paper, we propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module, which enhances the semantic corresponding relations between the local and global information, and obtains more accurate feature representations for the image and text modalities. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment. To justify the proposed model, we perform extensive experiments on MS-COCO and Flickr30K datasets. Experimental results show that the proposed HGAN outperforms the state-of-the-art methods on both datasets, which demonstrates the effectiveness and superiority of our model.
translated by 谷歌翻译
Recent works on Lottery Ticket Hypothesis have shown that pre-trained language models (PLMs) contain smaller matching subnetworks(winning tickets) which are capable of reaching accuracy comparable to the original models. However, these tickets are proved to be notrobust to adversarial examples, and even worse than their PLM counterparts. To address this problem, we propose a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs. Since the loss is not differentiable for the binary mask, we assign the hard concrete distribution to the masks and encourage their sparsity using a smoothing approximation of L0 regularization.Furthermore, we design an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well bothin accuracy and robustness. Experimental results show the significant improvement of the proposed method over previous work on adversarial robustness evaluation.
translated by 谷歌翻译
当测试图像提出看不见的分布时,深层分割模型通常会面临故障风险。改善模型鲁棒性针对这些风险的鲁棒性对于深层模型的大规模临床应用至关重要。在这项研究中,受到人类学习周期的启发,我们提出了一个新颖的在线反思学习框架(REFSEG),以改善细分鲁棒性。基于启用概念的反射概念,我们的refseg首先驱动了深层模型以采取行动以获得语义分割。然后,refseg触发模型以反映自身。因为使深层模型在测试过程中意识到他们的细分失败是具有挑战性的,所以RefSeg合成了从语义面具中综合的逼真的代理图像,以帮助深层模型构建直观有效的反射。该代理翻译并强调了分割缺陷。通过最大程度地提高原始输入和代理之间的结构相似性,可以改善分割鲁棒性的反射循环。 REFSEG在测试阶段运行,并且是分割模型的一般性。通过公共心脏MR数据集和两个内部大型超声数据集对三个医疗图像细分任务进行了广泛的验证,这表明我们的refseg显着提高了模型的鲁棒性,并报告了与强大竞争对手有关的最先进的表现。
translated by 谷歌翻译
超声(US)广泛用于实时成像,无辐射和便携性的优势。在临床实践中,分析和诊断通常依赖于美国序列,而不是单个图像来获得动态的解剖信息。对于新手来说,这是一项挑战,因为使用患者的足够视频进行练习是临床上不可行的。在本文中,我们提出了一个新颖的框架,以综合高保真美国视频。具体而言,合成视频是通过基于给定驾驶视频的动作来动画源内容图像来生成的。我们的亮点是三倍。首先,利用自我监督学习的优势,我们提出的系统以弱监督的方式进行了培训,以进行关键点检测。然后,这些关键点为处理美国视频中的复杂动态动作提供了重要信息。其次,我们使用双重解码器将内容和纹理学习解除,以有效地减少模型学习难度。最后,我们采用了对抗性训练策略,并采用了GAN损失,以进一步改善生成的视频的清晰度,从而缩小了真实和合成视频之间的差距。我们在具有高动态运动的大型内部骨盆数据集上验证我们的方法。广泛的评估指标和用户研究证明了我们提出的方法的有效性。
translated by 谷歌翻译
在本文中,我们考虑了在规避风险的标准下线性收益的上下文多臂强盗问题。在每个回合中,每个手臂都会揭示上下文,决策者选择一只手臂拉动并获得相应的奖励。特别是,我们将均值变化视为风险标准,最好的组是具有最大均值奖励的均值。我们将汤普森采样算法应用于脱节模型,并为提出算法的变体提供全面的遗憾分析。对于$ t $ rounds,$ k $ Actions和$ d $ - 维功能向量,我们证明了$ o((1+ \ rho+\ frac {1} {1} {\ rho}){\ rho})d \ ln t \ ln t \ ln的遗憾。 \ frac {k} {\ delta} \ sqrt {d k t^{1+2 \ epsilon} \ ln \ frac {k} {\ delta} \ frac {1} {\ epsilon}} $ 1 - \ \ delta $在带有风险公差$ \ rho $的均值方差标准下,对于任何$ 0 <\ epsilon <\ frac {1} {2} $,$ 0 <\ delta <1 $。我们提出的算法的经验性能通过投资组合选择问题来证明。
translated by 谷歌翻译
扩散概率模型(DPM)是新兴的强大生成模型。尽管具有高质量的生成性能,但DPM仍然遭受缓慢采样的苦难,因为它们通常需要数百或数千个大型神经网络的顺序函数评估(步骤)来绘制样本。可以将来自DPM的采样视为求解相应的扩散普通微分方程(ODE)。在这项工作中,我们提出了扩散ODE的溶液的精确表述。该公式通过分析计算解决方案的线性部分,而不是将所有术语留给先前工作中采用的黑盒ode求解器。通过应用可变化的更改,可以将解决方案等效地简化为神经网络的指数加权积分。根据我们的公式,我们提出了DPM-Solver,这是一种通过收敛顺序保证的快速专用高阶求解器。 DPM溶剂适用于离散时间和连续时间DPM,而无需进行任何进一步的培训。实验结果表明,DPM-Solver可以在各种数据集上的10至20个功能评估中生成高质量的样本。我们在10个功能评估中实现了4.70 FID,在CIFAR10数据集上进行20个功能评估中的2.87 FID,与以前的各种数据集中的先前最先进的无培训样本器相比,$ 4 \ sim 16 \ times $速度。
translated by 谷歌翻译
Gradient estimation -- approximating the gradient of an expectation with respect to the parameters of a distribution -- is central to the solution of many machine learning problems. However, when the distribution is discrete, most common gradient estimators suffer from excessive variance. To improve the quality of gradient estimation, we introduce a variance reduction technique based on Stein operators for discrete distributions. We then use this technique to build flexible control variates for the REINFORCE leave-one-out estimator. Our control variates can be adapted online to minimize variance and do not require extra evaluations of the target function. In benchmark generative modeling tasks such as training binary variational autoencoders, our gradient estimator achieves substantially lower variance than state-of-the-art estimators with the same number of function evaluations.
translated by 谷歌翻译
近年来目睹了采用灵活的机械学习模型进行乐器变量(IV)回归的兴趣,但仍然缺乏不确定性量化方法的发展。在这项工作中,我们为IV次数回归提出了一种新的Quasi-Bayesian程序,建立了最近开发的核化IV模型和IV回归的双/极小配方。我们通过在$ l_2 $和sobolev规范中建立最低限度的最佳收缩率,并讨论可信球的常见有效性来分析所提出的方法的频繁行为。我们进一步推出了一种可扩展的推理算法,可以扩展到与宽神经网络模型一起工作。实证评价表明,我们的方法对复杂的高维问题产生了丰富的不确定性估计。
translated by 谷歌翻译
Visual Question Answering (VQA) requires a finegrained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective 'co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness.Experimental results demonstrate that MCAN significantly outperforms the previous state-ofthe-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.Code is available at https://github.com/MILVLG/mcan-vqa.
translated by 谷歌翻译
To facilitate research on text generation, this paper presents a comprehensive and unified library, TextBox 2.0, focusing on the use of pre-trained language models (PLMs). To be comprehensive, our library covers $13$ common text generation tasks and their corresponding $83$ datasets and further incorporates $45$ PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs. We also implement $4$ efficient training strategies and provide $4$ generation objectives for pre-training new PLMs from scratch. To be unified, we design the interfaces to support the entire research pipeline (from data loading to training and evaluation), ensuring that each step can be fulfilled in a unified way. Despite the rich functionality, it is easy to use our library, either through the friendly Python API or command line. To validate the effectiveness of our library, we conduct extensive experiments and exemplify four types of research scenarios. The project is released at the link: https://github.com/RUCAIBox/TextBox.
translated by 谷歌翻译