对于许多技术领域的专业用户,例如医学,遥感,精密工程和科学研究,无损和近乎无情的图像压缩至关重要。但是,尽管在基于学习的图像压缩方面的研究兴趣迅速增长,但没有发表的方法提供无损和近乎无情的模式。在本文中,我们提出了一个统一而强大的深层损失加上残留(DLPR)编码框架,以实现无损和近乎无情的图像压缩。在无损模式下,DLPR编码系统首先执行有损压缩,然后执行残差的无损编码。我们在VAE的方法中解决了关节损失和残留压缩问题,并添加残差的自回归上下文模型以增强无损压缩性能。在近乎荒谬的模式下,我们量化了原始残差以满足给定的$ \ ell_ \ infty $错误绑定,并提出了可扩展的近乎无情的压缩方案,该方案适用于可变$ \ ell_ \ infty $ bunds而不是训练多个网络。为了加快DLPR编码,我们通过新颖的编码环境设计提高了算法并行化的程度,并以自适应残留间隔加速熵编码。实验结果表明,DLPR编码系统以竞争性的编码速度实现了最先进的无损和近乎无效的图像压缩性能。
translated by 谷歌翻译
深度神经网络(DNNS)的广泛应用要求越来越多的关注对其现实世界的鲁棒性,即DNN是否抵抗黑盒对抗性攻击,其中包括基于得分的查询攻击(SQA)是最威胁性的。由于它们的实用性和有效性:攻击者只需要在模型输出上进行数十个查询即可严重伤害受害者网络。针对SQA的防御需要对用户的服务目的而略有但巧妙的输出变化,这些用户与攻击者共享相同的输出信息。在本文中,我们提出了一种称为统一梯度(UNIG)的现实世界防御,以统一不同数据的梯度,以便攻击者只能探究不同样本相似的较弱的攻击方向。由于这种普遍的攻击扰动的验证与投入特定的扰动相比,Unig通过指示攻击者一个扭曲且信息不足的攻击方向来保护现实世界中的DNN。为了增强Unig在现实世界应用中的实际意义,我们将其实现为Hadamard产品模块,该模块具有计算效率且很容易插入任何模型。根据对5个SQA和4个防御基线的广泛实验,Unig显着改善了现实世界的鲁棒性,而不会伤害CIFAR10和Imagenet上的清洁准确性。例如,Unig在2500 Query Square攻击下保持了77.80%精度的CIFAR-10模型,而最先进的对手训练的模型仅在CIFAR10上具有67.34%的速度。同时,Unig在清洁精度和输出的修改程度上大大超过了所有基准。代码将发布。
translated by 谷歌翻译
The score-based query attacks (SQAs) pose practical threats to deep neural networks by crafting adversarial perturbations within dozens of queries, only using the model's output scores. Nonetheless, we note that if the loss trend of the outputs is slightly perturbed, SQAs could be easily misled and thereby become much less effective. Following this idea, we propose a novel defense, namely Adversarial Attack on Attackers (AAA), to confound SQAs towards incorrect attack directions by slightly modifying the output logits. In this way, (1) SQAs are prevented regardless of the model's worst-case robustness; (2) the original model predictions are hardly changed, i.e., no degradation on clean accuracy; (3) the calibration of confidence scores can be improved simultaneously. Extensive experiments are provided to verify the above advantages. For example, by setting $\ell_\infty=8/255$ on CIFAR-10, our proposed AAA helps WideResNet-28 secure 80.59% accuracy under Square attack (2500 queries), while the best prior defense (i.e., adversarial training) only attains 67.44%. Since AAA attacks SQA's general greedy strategy, such advantages of AAA over 8 defenses can be consistently observed on 8 CIFAR-10/ImageNet models under 6 SQAs, using different attack targets, bounds, norms, losses, and strategies. Moreover, AAA calibrates better without hurting the accuracy. Our code is available at https://github.com/Sizhe-Chen/AAA.
translated by 谷歌翻译
自动驾驶技术的加速开发对获得大量高质量数据的需求更大。标签,现实世界数据代表性是培训深度学习网络的燃料,对于改善自动驾驶感知算法至关重要。在本文中,我们介绍了PANDASET,由完整的高精度自动车辆传感器套件生产的第一个数据集,具有无需成本商业许可证。使用一个360 {\ DEG}机械纺丝利达,一个前置,远程LIDAR和6个摄像机收集数据集。DataSet包含100多个场景,每个场景为8秒,为目标分类提供28种类型的标签和37种类型的语义分割标签。我们提供仅限LIDAR 3D对象检测的基线,LIDAR-Camera Fusion 3D对象检测和LIDAR点云分割。有关Pandaset和开发套件的更多详细信息,请参阅https://scale.com/open-datasets/pandaset。
translated by 谷歌翻译
单步逆势培训(AT)受到了广泛的关注,因为它被证明是有效和健壮的。然而,存在严重的灾难性过度问题,即反对投影梯度下降(PGD)攻击的强劲准确性突然下降到培训期间的0.5美元。在本文中,我们从优化的新角度来看,首先揭示每个样品和过度装箱的快速增长梯度之间的密切联系,这也可以应用于了解多步骤中的稳健的过度拟合现象。为了控制培训期间梯度的增长,我们提出了一种新的方法,子空间对抗训练(子AT),限制了仔细提取的子空间。它成功地解决了两种过度装备,因此显着提高了鲁棒性。在子空间中,我们还允许单步合并较大的步骤和更大的半径,从而进一步提高了鲁棒性性能。因此,我们实现了最先进的单步性能:我们的纯单步可以达到超过$ \ mathbf {51} \%$鲁棒准确性,反对强大的PGD-50攻击以半径8美元/ CiFar-10上的255美元,甚至超过了标准的多步PGD-10,具有巨大的计算优势。代码已释放$ \脚注{\ url {https://github.com/nblt/sub -at}} $。
translated by 谷歌翻译
对于真实世界形象超分辨率的深度学习方法,最关键的问题是对训练的配对低和高分辨率图像是否准确反映了真实相机的采样过程。由现有的退化模型(例如,双臂下采样)合成的低分辨率(LR $ \ SIM $ HR)图像对偏离现实中的模型;因此,当应用于真实图像时,由这些合成的LR $ \ SIM $ HR图像对训练的超分辨率CNN不会表现良好。为了解决问题,我们提出了一种新的数据采集过程,使用真实相机拍摄一大集的LR $ \ SIM $ HR图像对。图像显示在超高质量屏幕上并以不同的分辨率捕获。由此产生的LR $ \ SIM $ HR图像对可以通过新颖的空间频率二元域注册方法与非常高的子像素精度对齐,因此它们为超级分辨率的学习任务提供了高质量的培训数据。此外,捕获的HR图像和原始数字图像提供了双引用来提高学习性能。实验结果表明,我们的LR $ \ SIM $ HR DataSet培训超分辨率CNN,而不是文献中的其他数据集培训更高的图像质量。
translated by 谷歌翻译
Audio-visual approaches involving visual inputs have laid the foundation for recent progress in speech separation. However, the optimization of the concurrent usage of auditory and visual inputs is still an active research area. Inspired by the cortico-thalamo-cortical circuit, in which the sensory processing mechanisms of different modalities modulate one another via the non-lemniscal sensory thalamus, we propose a novel cortico-thalamo-cortical neural network (CTCNet) for audio-visual speech separation (AVSS). First, the CTCNet learns hierarchical auditory and visual representations in a bottom-up manner in separate auditory and visual subnetworks, mimicking the functions of the auditory and visual cortical areas. Then, inspired by the large number of connections between cortical regions and the thalamus, the model fuses the auditory and visual information in a thalamic subnetwork through top-down connections. Finally, the model transmits this fused information back to the auditory and visual subnetworks, and the above process is repeated several times. The results of experiments on three speech separation benchmark datasets show that CTCNet remarkably outperforms existing AVSS methods with considerablely fewer parameters. These results suggest that mimicking the anatomical connectome of the mammalian brain has great potential for advancing the development of deep neural networks. Project repo is https://github.com/JusperLee/CTCNet.
translated by 谷歌翻译
This is a brief technical report of our proposed method for Multiple-Object Tracking (MOT) Challenge in Complex Environments. In this paper, we treat the MOT task as a two-stage task including human detection and trajectory matching. Specifically, we designed an improved human detector and associated most of detection to guarantee the integrity of the motion trajectory. We also propose a location-wise matching matrix to obtain more accurate trace matching. Without any model merging, our method achieves 66.672 HOTA and 93.971 MOTA on the DanceTrack challenge dataset.
translated by 谷歌翻译
Adversarial attacks can easily fool object recognition systems based on deep neural networks (DNNs). Although many defense methods have been proposed in recent years, most of them can still be adaptively evaded. One reason for the weak adversarial robustness may be that DNNs are only supervised by category labels and do not have part-based inductive bias like the recognition process of humans. Inspired by a well-known theory in cognitive psychology -- recognition-by-components, we propose a novel object recognition model ROCK (Recognizing Object by Components with human prior Knowledge). It first segments parts of objects from images, then scores part segmentation results with predefined human prior knowledge, and finally outputs prediction based on the scores. The first stage of ROCK corresponds to the process of decomposing objects into parts in human vision. The second stage corresponds to the decision process of the human brain. ROCK shows better robustness than classical recognition models across various attack settings. These results encourage researchers to rethink the rationality of currently widely-used DNN-based object recognition models and explore the potential of part-based models, once important but recently ignored, for improving robustness.
translated by 谷歌翻译
It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.
translated by 谷歌翻译