Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).
translated by 谷歌翻译
Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques.
translated by 谷歌翻译
机器学习中的知识蒸馏是将知识从名为教师的大型模型转移到一个名为“学生”的较小模型的过程。知识蒸馏是将大型网络(教师)压缩到较小网络(学生)的技术之一,该网络可以部署在手机等小型设备中。当教师和学生之间的网络规模差距增加时,学生网络的表现就会下降。为了解决这个问题,在教师模型和名为助教模型的学生模型之间采用了中间模型,这反过来弥补了教师与学生之间的差距。在这项研究中,我们已经表明,使用多个助教模型,可以进一步改进学生模型(较小的模型)。我们使用加权集合学习将这些多个助教模型组合在一起,我们使用了差异评估优化算法来生成权重值。
translated by 谷歌翻译
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
translated by 谷歌翻译
Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.
translated by 谷歌翻译
基于蒸馏的压缩网络的性能受蒸馏质量的管辖。大型网络(教师)到较小网络(学生)的次优蒸馏的原因主要归因于给定教师与学生对的学习能力中的差距。虽然很难蒸馏所有教师的知识,但可以在很大程度上控制蒸馏质量以实现更好的性能。我们的实验表明,蒸馏品质主要受教师响应的质量来限制,这反过来又受到其反应中存在相似信息的影响。训练有素的大容量老师在学习细粒度辨别性质的过程中丢失了类别之间的相似性信息。没有相似性信息导致蒸馏过程从一个例子 - 许多阶级学习减少到一个示例 - 一类学习,从而限制了教师的不同知识的流程。由于隐式假设只能蒸馏出灌输所知,而不是仅关注知识蒸馏过程,我们仔细审查了知识序列过程。我们认为,对于给定的教师 - 学生对,通过在训练老师的同时找到批量大小和时代数量之间的甜蜜点,可以提高蒸馏品。我们讨论了找到这种甜蜜点以便更好地蒸馏的步骤。我们还提出了蒸馏假设,以区分知识蒸馏和正则化效果之间的蒸馏过程的行为。我们在三个不同的数据集中进行我们的所有实验。
translated by 谷歌翻译
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .
translated by 谷歌翻译
已经证明了对比学习适合学习句子嵌入,可以显着提高语义文本相似性(STS)任务。最近,大型对比学习模型,例如句子T5倾向于学到更强大的句子嵌入。虽然有效,但由于计算资源或时间成本限制,这种大型型号很难在线服务。为了解决这个问题,通常采用知识蒸馏(KD),这可以将大型“教师”模型压缩成一个小的“学生”模型,但通常会遭受一些性能损失。在这里,我们提出了一个增强的KD框架,称为蒸馏 - 对比度(迪斯科)。所提出的迪斯科框架首先利用KD将大句子嵌入模型的能力转移到大型未标记数据的小学生模型,然后在标记的训练数据上具有对比学习的学生模型。对于迪斯科舞厅的KD进程,我们进一步提出了对比的知识蒸馏(CKD),以增强教师模型培训,KD和学生模型的一致性,这可能会提高迅速学习的表现。 7 STS基准测试的广泛实验表明,使用所提出的迪斯科和CKD培训的学生模型很少或甚至没有性能损失,并且始终如一地优于相同参数大小的相应对应物。令人惊讶的是,我们的110米学生模型甚至可以优于最新的最新(SOTA)模型,即句子T5(11B),只有1%的参数。
translated by 谷歌翻译
知识蒸馏是通过知识转移模型压缩的有效稳定的方法。传统知识蒸馏(KD)是将来自大型和训练有素的教师网络的知识转移到小型学生网络,这是一种单向过程。最近,已经提出了深度相互学习(DML)来帮助学生网络协同和同时学习。然而,据我们所知,KD和DML从未在统一的框架中共同探索,以解决知识蒸馏问题。在本文中,我们调查教师模型在KD中支持更值得信赖的监督信号,而学生则在DML中捕获教师的类似行为。基于这些观察,我们首先建议将KD与DML联合在统一的框架中。此外,我们提出了一个半球知识蒸馏(SOKD)方法,有效提高了学生和教师的表现。在这种方法中,我们在DML中介绍了同伴教学培训时尚,以缓解学生的模仿困难,并利用KD训练有素的教师提供的监督信号。此外,我们还显示我们的框架可以轻松扩展到基于功能的蒸馏方法。在CiFAR-100和Imagenet数据集上的广泛实验证明了所提出的方法实现了最先进的性能。
translated by 谷歌翻译
除了使用硬标签的标准监督学习外,通常在许多监督学习设置中使用辅助损失来改善模型的概括。例如,知识蒸馏增加了第二个教师模仿模型训练的损失,在该培训中,教师可能是一个验证的模型,可以输出比标签更丰富的分布。同样,在标记数据有限的设置中,弱标记信息以标签函数的形式使用。此处引入辅助损失来对抗标签函数,这些功能可能是基于嘈杂的规则的真实标签近似值。我们解决了学习以原则性方式结合这些损失的问题。我们介绍AMAL,该AMAL使用元学习在验证度量上学习实例特定的权重,以实现损失的最佳混合。在许多知识蒸馏和规则降解域中进行的实验表明,Amal在这些领域中对竞争基准的增长可显着。我们通过经验分析我们的方法,并分享有关其提供性能提升的机制的见解。
translated by 谷歌翻译
知识蒸馏是将“知识”从大型模型(教师)转移到更紧凑的(学生)的过程,通常在模型压缩的背景下使用。当两个模型都具有相同的体系结构时,此过程称为自distillation。几项轶事表明,一个自灭的学生可以在持有的数据上胜过老师的表现。在这项工作中,我们系统地研究了许多设置。我们首先表明,即使有一个高度准确的老师,自我介绍也使学生在所有情况下都可以超越老师。其次,我们重新审视了(自我)蒸馏的现有理论解释,并确定矛盾的例子,揭示了这些解释的可能缺点。最后,我们通过损失景观几何形状的镜头为自我鉴定的动态提供了另一种解释。我们进行了广泛的实验,以表明自我验证会导致最小化的最小值,从而导致更好的概括。
translated by 谷歌翻译
在过去的几年中,基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是,较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍,尤其是在手机和物联网(IoT)设备上。为了压缩该模型,最近有大量文献围绕知识蒸馏(KD)的主题长大。然而,KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件,并提出了一个统一的KD框架。通过框架,花费了23,000多个GPU小时的系统和广泛的实验,从知识类型的角度,匹配策略,宽度深度折衷,初始化,型号大小等。在培训前语言模型中,对先前最新的(SOTA)的相对显着改善。最后,我们为基于变压器模型的KD提供了最佳实践指南。
translated by 谷歌翻译
知识蒸馏(KD)是压缩边缘设备深层分类模型的有效工具。但是,KD的表现受教师和学生网络之间较大容量差距的影响。最近的方法已诉诸KD的多个教师助手(TA)设置,该设置依次降低了教师模型的大小,以相对弥合这些模型之间的尺寸差距。本文提出了一种称为“知识蒸馏”课程专家选择的新技术,以有效地增强在容量差距问题下对紧凑型学生的学习。该技术建立在以下假设的基础上:学生网络应逐渐使用分层的教学课程来逐步指导,因为它可以从较低(较高的)容量教师网络中更好地学习(硬)数据样本。具体而言,我们的方法是一种基于TA的逐渐的KD技术,它每个输入图像选择单个教师,该课程是基于通过对图像进行分类的难度驱动的课程的。在这项工作中,我们凭经验验证了我们的假设,并对CIFAR-10,CIFAR-100,CINIC-10和Imagenet数据集进行了严格的实验,并在类似VGG的模型,Resnets和WideresNets架构上显示出提高的准确性。
translated by 谷歌翻译
在多种方式知识蒸馏研究的背景下,现有方法主要集中在唯一的学习教师最终产出问题。因此,教师网络与学生网络之间存在深处。有必要强制学生网络来学习教师网络的模态关系信息。为了有效利用从教师转移到学生的知识,采用了一种新的模型关系蒸馏范式,通过建模不同的模态之间的关系信息,即学习教师模级克矩阵。
translated by 谷歌翻译
知识蒸馏(KD)是一种有效的方法,可以将知识从大型“教师”网络转移到较小的“学生”网络。传统的KD方法需要大量标记的培训样本和白盒老师(可以访问参数)才能培训好学生。但是,这些资源并不总是在现实世界应用中获得。蒸馏过程通常发生在我们无法访问大量数据的外部政党方面,并且由于安全性和隐私问题,教师没有披露其参数。为了克服这些挑战,我们提出了一种黑盒子少的KD方法,以培训学生很少的未标记培训样本和一个黑盒老师。我们的主要思想是通过使用混合和有条件的变异自动编码器生成一组不同的分布合成图像来扩展训练集。这些合成图像及其从老师获得的标签用于培训学生。我们进行了广泛的实验,以表明我们的方法在图像分类任务上明显优于最近的SOTA/零射击KD方法。代码和型号可在以下网址找到:https://github.com/nphdang/fs-bbt
translated by 谷歌翻译
One of the most efficient methods for model compression is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact and use the same hint points as in the early studies. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance compared to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
translated by 谷歌翻译
最初引入了知识蒸馏,以利用来自单一教师模型的额外监督为学生模型培训。为了提高学生表现,最近的一些变体试图利用多个教师利用不同的知识来源。然而,现有研究主要通过对多种教师预测的平均或将它们与其他无标签策略相结合,将知识集成在多种来源中,可能在可能存在低质量的教师预测存在中误导学生。为了解决这个问题,我们提出了信心感知的多教师知识蒸馏(CA-MKD),该知识蒸馏(CA-MKD)在地面真理标签的帮助下,适用于每个教师预测的样本明智的可靠性,与那些接近单热的教师预测标签分配了大量的重量。此外,CA-MKD包含中间层,以进一步提高学生表现。广泛的实验表明,我们的CA-MKD始终如一地优于各种教师学生架构的所有最先进的方法。
translated by 谷歌翻译
Figure 1. An illustration of standard knowledge distillation. Despite widespread use, an understanding of when the student can learn from the teacher is missing.
translated by 谷歌翻译
知识蒸馏(KD)已广泛发展并增强了各种任务。经典的KD方法将KD损失添加到原始的跨熵(CE)损失中。我们尝试分解KD损失,以探索其与CE损失的关系。令人惊讶的是,我们发现它可以被视为CE损失和额外损失的组合,其形式与CE损失相同。但是,我们注意到额外的损失迫使学生学习教师绝对概率的相对可能性。此外,这两个概率的总和是不同的,因此很难优化。为了解决这个问题,我们修改了配方并提出分布式损失。此外,我们将教师的目标输出作为软目标,提出软损失。结合软损失和分布式损失,我们提出了新的KD损失(NKD)。此外,我们将学生的目标输出稳定,将其视为无需教师的培训的软目标,并提出了无教师的新KD损失(TF-NKD)。我们的方法在CIFAR-100和Imagenet上实现了最先进的性能。例如,以Resnet-34为老师,我们将Imagenet TOP-1的RESNET18的TOP-1精度从69.90%提高到71.96%。在没有教师的培训中,Mobilenet,Resnet-18和Swintransformer-tiny的培训占70.04%,70.76%和81.48%,分别比基线高0.83%,0.86%和0.30%。该代码可在https://github.com/yzd-v/cls_kd上找到。
translated by 谷歌翻译
由于使用较大的模型,最先进的深度学习导致深度学习一直在改善。然而,广泛的使用受到设备硬件限制的约束,导致最先进的模型与可以在小型设备上有效部署的模型之间的实质性差距。虽然知识蒸馏(KD)理论上使小型学生模型能够模拟更大的教师模型,在实践中选择良好的学生架构需要相当大的人类专业知识。神经结构搜索(NAS)出现在这个问题的自然解决方案中,但大多数方法可以效率低下,因为大多数计算都花费了比较了从相同分布采样的架构,性能差异可忽略不计。在本文中,我们建议寻找一系列学生架构,分享从给定老师擅长学习的财产。我们的方法Autokd由贝叶斯优化支持,探讨了一个灵活的基于图形的搜索空间,使我们能够自动学习最佳学生架构分布和KD参数,而与现有的最先进相比,效率更高。我们在3个数据集中评估我们的方法;在大型图像上专门地,我们在使用3倍的内存时达到教师性能和10倍的参数。最后,虽然Autokd使用传统的KD丢失,但它使用手工设计的学生更优先地表达更先进的KD变体。
translated by 谷歌翻译