Pre-trained language models for programming languages have shown a powerful ability on processing many Software Engineering (SE) tasks, e.g., program synthesis, code completion, and code search. However, it remains to be seen what is behind their success. Recent studies have examined how pre-trained models can effectively learn syntax information based on Abstract Syntax Trees. In this paper, we figure out what role the self-attention mechanism plays in understanding code syntax and semantics based on AST and static analysis. We focus on a well-known representative code model, CodeBERT, and study how it can learn code syntax and semantics by the self-attention mechanism and Masked Language Modelling (MLM) at the token level. We propose a group of probing tasks to analyze CodeBERT. Based on AST and static analysis, we establish the relationships among the code tokens. First, Our results show that CodeBERT can acquire syntax and semantics knowledge through self-attention and MLM. Second, we demonstrate that the self-attention mechanism pays more attention to dependence-relationship tokens than to other tokens. Different attention heads play different roles in learning code semantics; we show that some of them are weak at encoding code semantics. Different layers have different competencies to represent different code properties. Deep CodeBERT layers can encode the semantic information that requires some complex inference in the code context. More importantly, we show that our analysis is helpful and leverage our conclusions to improve CodeBERT. We show an alternative approach for pre-training models, which makes fully use of the current pre-training strategy, i.e, MLM, to learn code syntax and semantics, instead of combining features from different code data formats, e.g., data-flow, running-time states, and program outputs.
translated by 谷歌翻译
当使用深度学习技术对程序语言进行建模时,广泛采用了带有树或图形结构的神经网络,以捕获程序抽象语法树(AST)中的丰富结构信息。但是,计划中广泛存在长期/全球依赖性,大多数这些神经体系结构无法捕获这些依赖性。在本文中,我们提出了Tree-Transformer,这是一种新型的递归树结构神经网络,旨在克服上述局限性。树转化器利用两个多头注意单元来建模兄弟姐妹和父子节点对之间的依赖关系。此外,我们提出了一个双向传播策略,以允许节点信息向两个方向传递:沿树木的自下而上和自上而下。通过结合自下而上和自上而下的传播,树转化器可以同时学习全局上下文和有意义的节点特征。广泛的实验结果表明,我们的树转换器在具有树级和节点级别的预测任务中,在与程序相关的任务中优于现有基于树或基于图的神经网络,这表明Tree-Transformer在学习两个树级时都表现良好和节点级表示。
translated by 谷歌翻译
GitHub提交的记录,该代码随着自然语言消息的描述而变化,对于软件开发人员来说,在理解软件演变方面起着至关重要的作用。为了促进开源软件社区的开发,我们收集了一个提交基准,包括790万次跨7种编程语言的投入。基于此基准测试,我们提出了Citsbart,这是GitHub提交的大型预训练的编码器变压器模型。该模型由三个类别(即,为了学习提交碎片表示的六个预训练任务)预先培训(即,剥夺目标,跨模式生成和对比度学习)。此外,我们将一个“委托智能”框架与一项理解任务和提交的三个世代任务统一。这些任务的综合实验表明,提案巴特大大优于以前的代码预先培训作品。进一步的分析还揭示了每个预训练任务可增强模型性能。我们鼓励后续研究人员在将来为我们的框架贡献更多与承诺相关的下游任务。
translated by 谷歌翻译
最近,对抗机器学习攻击对实用音频信号分类系统构成了严重的安全威胁,包括语音识别,说话者识别和音乐版权检测。先前的研究主要集中在确保通过在原始信号上产生类似小噪声的扰动来攻击音频信号分类器的有效性。目前尚不清楚攻击者是否能够创建音频信号扰动,除了其攻击效果外,人类还可以很好地看待。这对于音乐信号尤其重要,因为它们经过精心制作,具有可让人的音频特征。在这项工作中,我们将对音乐信号的对抗性攻击作为一种新的感知攻击框架,将人类研究纳入对抗性攻击设计中。具体而言,我们进行了一项人类研究,以量化人类对音乐信号的变化的看法。我们邀请人类参与者根据对原始和扰动的音乐信号对进行评分,并通过回归分析对人类感知过程进行反向工程,以预测给定信号的人类感知的偏差。然后将感知感知的攻击作为优化问题提出,该问题找到了最佳的扰动信号,以最大程度地减少对回归人类感知模型的感知偏差的预测。我们使用感知感知的框架来设计对YouTube版权探测器的现实对抗音乐攻击。实验表明,感知意识攻击会产生对抗性音乐的感知质量明显优于先前的工作。
translated by 谷歌翻译
联合学习(FL)提供了一种高效的分散机器学习框架,其中培训数据仍然在网络中的远程客户端分发。虽然FL实现了使用物联网设备的隐私保留的移动边缘计算框架,但最近的研究表明,这种方法易于来自远程客户端的侧面中毒攻击。要解决FL的中毒攻击,我们提供了一个\ Textit {两阶段}防御算法,称为{lo} cal {ma}恶意的事实{r}(lomar)。在I阶段I中,通过使用内核密度估计方法测量其邻居的相对分布,LOMAR从每个远程客户端进行模型更新。在II阶段,最佳阈值近似以从统计角度来区分恶意和清洁更新。已经进行了四个现实数据集的综合实验,实验结果表明,我们的防御策略可以有效保护FL系统。 {具体来说,标签翻转攻击下的亚马逊数据集上的防御性能表明,与FG + Krum相比,LOMAR从96.0 \%$ 98.8 \%$ 96.0 \%$ 98.8 \%$增加目标标签测试精度,以及90.1美元的总平均测试准确性\%$至97.0 \%$。
translated by 谷歌翻译
代码搜索目标是根据自然语言查询检索相关的代码片段,以提高软件生产力和质量。但是,由于源代码和查询之间的语义间隙,自动代码搜索是具有挑战性的。大多数现有方法主要考虑嵌入的顺序信息,其中文本背后的结构信息不完全考虑。在本文中,我们设计了一个名为GraphsearchNet的新型神经网络框架,通过共同学习源代码和查询的富集语义来启用有效和准确的源代码搜索。具体地,我们建议将源代码和查询编码为两个图,其中双向GGNN以捕获图表的本地结构信息。此外,我们通过利用有效的多主题来增强BigGNN,以补充BigGNN错过的全球依赖。关于Java和Python数据集的广泛实验说明了GraphSearchNet优于当前最先进的工作原位。
translated by 谷歌翻译
跟踪是大型强子撞机(LHC)的事件重建最耗时的方面之一及其高亮度升级(HL-LHC)。通过在模式识别和参数估计中包括时序,创新的探测器技术将跟踪到四维。然而,现在和未来的硬件已经具有通过现有的轨道播种算法主要未使用的附加信息。簇的形状为轨道播种提供了额外的尺寸,这可以显着降低轨道发现的组合挑战。我们使用神经网络来表明群集形状可以显着降低假组合背景的速度,同时保持高效率。我们使用集群单曲,双峰和三胞胎中的信息来展示这一点。来自TrackML挑战的仿真呈现了数值结果。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译