Multi-Exit models (MEMs) use an early-exit strategy to improve the accuracy and efficiency of deep neural networks (DNNs) by allowing samples to exit the network before the last layer. However, the effectiveness of MEMs in the presence of distribution shifts remains largely unexplored. Our work examines how distribution shifts generated by common image corruptions affect the accuracy/efficiency of MEMs. We find that under common corruptions, early-exiting at the first correct exit reduces the inference cost and provides a significant boost in accuracy ( 10%) over exiting at the last layer. However, with realistic early-exit strategies, which do not assume knowledge about the correct exits, MEMs still reduce inference cost but provide a marginal improvement in accuracy (1%) compared to exiting at the last layer. Moreover, the presence of distribution shift widens the gap between an MEM's maximum classification accuracy and realistic early-exit strategies by 5% on average compared with the gap on in-distribution data. Our empirical analysis shows that the lack of calibration due to a distribution shift increases the susceptibility of such early-exit strategies to exit early and increases misclassification rates. Furthermore, the lack of calibration increases the inconsistency in the predictions of the model across exits, leading to both inefficient inference and more misclassifications compared with evaluation on in-distribution data. Finally, we propose two metrics, underthinking and overthinking, that quantify the different behavior of practical early-exit strategy under distribution shifts, and provide insights into improving the practical utility of MEMs.
translated by 谷歌翻译
Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
translated by 谷歌翻译
由于具有强大的代表性,变形金刚在包括自然语言处理(NLP),计算机视觉和语音识别在内的广泛应用中越来越受欢迎。但是,利用这种代表性的能力有效地需要大量的数据,强大的正则化或两者兼而有之以减轻过度拟合。最近,基于掩盖的自动编码器的自我监督预处理策略已解锁了变压器的功能,这些策略依赖于直接或从未掩盖的内容对比的掩蔽输入进行重建。这种预训练的策略已在NLP中的BERT模型,Speak2VEC模型中使用,最近在Vision中的MAE模型中,该模型迫使该模型使用自动编码相关的目标来了解输入不同部分中的内容之间的关系。在本文中,我们提出了一种小说但令人惊讶的简单替代内容,以预测内容的位置,而无需为其提供位置信息。这样做需要变压器仅凭内容就可以理解输入不同部分之间的位置关系。这相当于有效的实现,其中借口任务是每个输入令牌所有可能位置之间的分类问题。我们在视觉和语音基准上进行了实验,我们的方法对强有力的监督训练基准进行了改进,并且与现代的无监督/自我监督预审方法相媲美。我们的方法还可以使经过训练的变压器在没有位置嵌入的情况下胜过训练有完整位置信息的训练的变压器。
translated by 谷歌翻译
自我发挥作用机制通过在所有输入令牌之间使用成对的注意来对远程环境进行建模。在这样做时,他们假设由个体令牌(例如文本字符或图像像素)定义的固定注意粒度,这对于在较高级别上建模复杂依赖性可能不是最佳的。在本文中,我们提出了ContextPool,通过调整每个令牌的注意力粒度来解决此问题。受到与合并以捕获远程依赖关系的Convnets成功的启发,我们学会了为每个令牌汇总相邻功能,然后在给定的注意力层中计算注意力。合并的权重和支撑大小是自适应确定的,允许汇总功能以不同的规模编码有意义的上下文。我们表明,ContextPool使注意力模型更具表现力,经常以更少的层次实现强大的性能,从而大大降低了成本。实验验证我们的上下文池模块插入变压器模型时,使用几种语言和图像基准的计算较少计算,匹配或超越了最先进的性能,胜过最新的作品,这些作品具有学习的上下文大小或稀疏注意的模式,并且也适用为了进行有效的功能学习。
translated by 谷歌翻译
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
translated by 谷歌翻译
We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence [1] and Neural Turing Machines [2], because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). We show Ptr-Nets can be used to learn approximate solutions to three challenging geometric problems -finding planar convex hulls, computing Delaunay triangulations, and the planar Travelling Salesman Problem -using training examples alone. Ptr-Nets not only improve over sequence-to-sequence with input attention, but also allow us to generalize to variable size output dictionaries. We show that the learnt models generalize beyond the maximum lengths they were trained on. We hope our results on these tasks will encourage a broader exploration of neural learning for discrete problems.
translated by 谷歌翻译
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists of maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements. Moreover, it was used successfully in our winning entry to the MSCOCO image captioning challenge, 2015.
translated by 谷歌翻译
Entity matching in Customer 360 is the task of determining if multiple records represent the same real world entity. Entities are typically people, organizations, locations, and events represented as attributed nodes in a graph, though they can also be represented as records in relational data. While probabilistic matching engines and artificial neural network models exist for this task, explaining entity matching has received less attention. In this demo, we present our Explainable Entity Matching (xEM) system and discuss the different AI/ML considerations that went into its implementation.
translated by 谷歌翻译
剂量体积直方图(DVH)指标是诊所中广泛接受的评估标准。但是,将这些指标纳入深度学习剂量预测模型,这是由于其非跨性别性和非差异性而具有挑战性的。我们提出了一种基于力矩的新型损失功能,用于预测具有挑战性的常规肺强度调节疗法(IMRT)计划的3D剂量分布。基于力矩的损耗函数是凸面和可区分的,并且可以轻松地将DVH指标合并到没有计算开销的任何深度学习框架中。也可以定制这些矩,以反映3D剂量预测中的临床优先级。例如,使用高阶矩可以在高剂量区域中更好地预测串行结构。我们使用了360的大型数据集(240次培训,50次进行验证,70次进行测试),使用2GY $ \ times $ 30分数的常规肺部患者使用我们机构的临床治疗计划来训练深度学习(DL)模型。我们使用计算机断层扫描(CT),计划目标体积(PTV)和风险风险轮廓(OAR)培训了UNET,例如CNN体系结构,以推断相应的素素3D剂量分布。我们评估了三种不同的损失函数:(1)流行的平均绝对误差(MAE)损失,(2)最近开发的MAE + DVH损失,以及(3)提出的MAE +矩损失。使用不同的DVH指标以及剂量得分和DVH得分比较了预测的质量,该指标最近由AAPM知识的计划大挑战挑战。具有(MAE +力矩)损耗函数的模型通过显着提高DVH得分(11%,p $ <$ 0.01),而具有相似的计算成本,从而超过了MAE损失的模型。它还优于接受(MAE+DVH)训练的模型,它可以显着提高计算成本(48%)和DVH得分(8%,p $ <$ 0.01)。
translated by 谷歌翻译
在肺结节表面上的尖锐/肺泡是肺癌恶性肿瘤的良好预测指标,因此是放射科医生的良好预测指标,作为标准化的肺-RADS临床评分标准的一部分。鉴于放射科医生的结节和2D切片评估的3D几何形状,手动调节/肺泡注释是一项繁琐的任务,因此,迄今为止,尚无公共数据集以探测这些临床报告在SOTA恶性预测中的重要性算法。作为本文的一部分,我们释放了一个大规模临床解释的放射线数据集,即Cirdataset,其中包含来自两个公共数据集的分段肺结节的956个放射学家QA/QC'QA/QC'spiculation/lobulation注释,Lidc-Idri(N = 883)(n = 883)(n = 883)(n = 883) lungx(n = 73)。我们还提出了一个基于多级Voxel2mesh扩展到节段结节的端到端深度学习模型(同时保留尖峰),对尖峰进行分类(尖锐/尖峰和弯曲/叶状/叶状)并执行恶性预测。先前的方法已经对LIDC和LUNGX数据集进行了恶性预测,但没有对任何临床报道/可操作的特征(由于已知的超参数敏感性问题,具有一般归因方案)。随着这种全面宣布的Cirdataset和端到端深度学习基线的发布,我们希望恶性预测方法可以验证其解释,对我们的基线进行基准测试,并提供临床上的见解。数据集,代码,预处理模型和Docker容器可在https://github.com/nadeemlab/cir上找到。
translated by 谷歌翻译