代码转换是关于在通信过程中处理替代语言。训练端到端(E2E)自动语音识别(ASR)系统用于代码开关是一个充满挑战的问题,因为由于存在多种语言,因此缺乏增加语言上下文混乱的数据加剧的数据。在本文中,我们提出了一种与语言相关的注意机制,以减少基于等价约束理论(EC)的E2E代码转换ASR模型的多语言上下文混乱。语言理论要求在代码转换句子中发生的任何单语片段都必须发生在一个单语句子中。它在单语言数据和代码转换数据之间建立了一个桥梁。通过计算多种语言的各自注意力,我们的方法可以从丰富的单语言数据中有效地传输语言知识。我们在ASRU 2019-English代码转换挑战数据集上评估我们的方法。与基线模型相比,提出的方法可实现11.37%的相对混合错误率降低。
translated by 谷歌翻译
端到端模型正在成为误用检测和诊断(MDD)的流行方法。许多实际应用要求的流MDD框架仍然是一个挑战。本文提出了一种名为CCA-MDD的流端到端MDD框架。CCA-MDD支持在线处理,并且能够实时运行。CCA-MDD的编码器包括基于Conv变压器网络的流式声学编码器,并改善了命名的耦合横向(CCA)的改进的横向关注。耦合的横向于预先编码的语言特征集成了编码的声学特征。应用从多任务学习培训的解码器的集合用于最终MDD决策。公开的Corpora实验表明,CCA-MDD可实现可比性的性能,以发布离线端到端MDD模型。
translated by 谷歌翻译
Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.
translated by 谷歌翻译
Traffic forecasting is an important application of spatiotemporal series prediction. Among different methods, graph neural networks have achieved so far the most promising results, learning relations between graph nodes then becomes a crucial task. However, improvement space is very limited when these relations are learned in a node-to-node manner. The challenge stems from (1) obscure temporal dependencies between different stations, (2) difficulties in defining variables beyond the node level, and (3) no ready-made method to validate the learned relations. To confront these challenges, we define legitimate traffic causal variables to discover the causal relation inside the traffic network, which is carefully checked with statistic tools and case analysis. We then present a novel model named Graph Spatial-Temporal Network Based on Causal Insight (GT-CausIn), where prior learned causal information is integrated with graph diffusion layers and temporal convolutional network (TCN) layers. Experiments are carried out on two real-world traffic datasets: PEMS-BAY and METR-LA, which show that GT-CausIn significantly outperforms the state-of-the-art models on mid-term and long-term prediction.
translated by 谷歌翻译
Person re-identification plays a significant role in realistic scenarios due to its various applications in public security and video surveillance. Recently, leveraging the supervised or semi-unsupervised learning paradigms, which benefits from the large-scale datasets and strong computing performance, has achieved a competitive performance on a specific target domain. However, when Re-ID models are directly deployed in a new domain without target samples, they always suffer from considerable performance degradation and poor domain generalization. To address this challenge, we propose a Deep Multimodal Fusion network to elaborate rich semantic knowledge for assisting in representation learning during the pre-training. Importantly, a multimodal fusion strategy is introduced to translate the features of different modalities into the common space, which can significantly boost generalization capability of Re-ID model. As for the fine-tuning stage, a realistic dataset is adopted to fine-tune the pre-trained model for better distribution alignment with real-world data. Comprehensive experiments on benchmarks demonstrate that our method can significantly outperform previous domain generalization or meta-learning methods with a clear margin. Our source code will also be publicly available at https://github.com/JeremyXSC/DMF.
translated by 谷歌翻译
异常检测任务在AI安全中起着至关重要的作用。处理这项任务存在巨大的挑战。观察结果表明,深度神经网络分类器通常倾向于以高信心将分布(OOD)输入分为分配类别。现有的工作试图通过在培训期间向分类器暴露于分类器时明确对分类器施加不确定性来解决问题。在本文中,我们提出了一种替代概率范式,该范式实际上对OOD检测任务既有用,又可行。特别是,我们在培训过程中施加了近距离和离群数据之间的统计独立性,以确保inlier数据在培训期间向深度估计器显示有关OOD数据的信息很少。具体而言,我们通过Hilbert-Schmidt独立标准(HSIC)估算了Inlier和离群数据之间的统计依赖性,并在培训期间对此类度量进行了惩罚。我们还将方法与推理期间的新型统计测试相关联,加上我们的原则动机。经验结果表明,我们的方法对各种基准测试的OOD检测是有效且可靠的。与SOTA模型相比,我们的方法在FPR95,AUROC和AUPR指标方面取得了重大改进。代码可用:\ url {https://github.com/jylins/hone}。
translated by 谷歌翻译
本文提出了一个简单而有效的框架蒙版,该框架将新提出的掩盖自distillation纳入对比的语言图像预处理中。掩盖自distillation的核心思想是将表示从完整的图像提取到蒙版图像预测的表示形式。这种合并享有两个重要的好处。首先,掩盖的自我验证目标是本地贴片表示学习,这与视觉对比度的互补,专注于与文本相关的表示。二,掩盖的自我验证也与视觉语言对比符合训练目标的视野对比是一致的。视觉编码器用于功能对齐,因此能够学习本地语义从该语言中获得间接监督。我们提供了专门设计的实验,并进行了全面的分析,以验证这两个好处。从经验上讲,我们表明,当MaskClip应用于各种具有挑战性的下游任务时,可以在线性探测,填充和零拍摄中取得卓越的结果,并在语言编码器的指导下取得了卓越的结果。
translated by 谷歌翻译
视觉问题回答是自然语言和愿景理解的重要任务。但是,在大多数公众视觉问题上回答了诸如VQA,CLEVR之类的数据集,这些问题是针对给定图像的特定于“她的眼睛是什么颜色?”的人类产生的。人类产生的众包问题相对简单,有时对某些实体或属性有偏见。在本文中,我们介绍了一个基于Image-Chiqa的新问题回答数据集。它包含Internet用户发布的现实查询,并结合了几个相关的开放域图像。系统应确定图像是否可以回答问题。与以前的VQA数据集不同,这些问题是现实世界中独立的查询,这些查询更加各种和无偏见。与先前的图像回程或图像捕获数据集相比,Chiqa不仅衡量了相关性,而且还可以衡量答案性,这需要更细粒度的视力和语言推理。 Chiqa包含超过40k的问题和超过200k的问题图像对。将三级2/1/0标签分配给每个对,指示完美的答案,部分答案和无关紧要。数据分析表明,Chiqa需要对语言和视觉有深入的了解,包括接地,比较和阅读。我们评估了几种最先进的视觉语言模型,例如ALBEF,表明仍然有一个很大的改进奇卡的空间。
translated by 谷歌翻译
我们提出了引导蒙面的自动编码器(bootmae),这是一种新的视觉BERT预训练方法。 Bootmae用两个核心设计改进了原始的蒙版自动编码器(MAE):1)动量编码器,该动量编码器可作为额外的BERT预测目标提供在线功能; 2)试图降低编码器的压力以记住目标特定信息的靶向解码器。第一个设计的动机是通过观察到的,即使用预定的MAE提取特征,因为掩盖令牌的BERT预测目标可以实现更好的预训练性能。因此,我们与原始的MAE编码器并行添加了一个动量编码器,该编码器通过将其自己的表示作为BERT预测目标来引导预处理性能。在第二个设计中,我们将特定于目标的信息(例如,未掩盖贴片的像素值)直接传达到解码器中,以减少记住目标特定信息的编码器的压力。因此,编码器专注于语义建模,这是BERT预训练的目的,并且不需要浪费其在记住与预测目标相关的未掩盖令牌的信息时的能力。通过广泛的实验,我们的Bootmae在ImageNet-1k上获得了$ 84.2 \%$ $ $ $+0.8 \%$在同一预训练时期。 Bootmae还获得了$+1.0 $ MIOU在ADE20K上的语义细分和$+1.3 $ box ap,$+1.4 $+1.4 $ bask ap改进对象检测和可可数据集上的细分。代码在https://github.com/lightdxy/bootmae上发布。
translated by 谷歌翻译
先前的工作提出了几种策略,以降低自我发挥机制的计算成本。这些作品中的许多作品都考虑将自我关注程序分解为区域和局部特征提取程序,这些程序都会产生较小的计算复杂性。但是,区域信息通常仅以损失的不良信息为代价,原因是由于下采样而丢失。在本文中,我们提出了一种新颖的变压器体系结构,旨在减轻成本问题,称为双视觉变压器(双击)。新的体系结构结合了一个关键的语义途径,可以更有效地将代币向量压缩到具有降低的复杂性顺序的全球语义中。然后,这种压缩的全局语义是通过另一个构造的像素途径在学习更精细的像素级详细信息中作为有用的先前信息。然后将语义途径和像素途径集成在一起并进行联合训练,从而通过这两个途径并行传播增强的自我运动信息。此后,双攻击能够降低计算复杂性,而不会损害很大的准确性。我们从经验上证明,双重射击比SOTA变压器体系结构具有较高的训练复杂性。源代码可在\ url {https://github.com/yehli/imagenetmodel}中获得。
translated by 谷歌翻译