提高深神经网络(DNN)对分布(OOD)数据的准确性对于在现实世界应用中接受深度学习(DL)至关重要。已经观察到,分布(ID)与OOD数据的准确性遵循线性趋势和模型表现优于该基线非常罕见(并被称为“有效鲁棒”)。最近,已经开发出一些有前途的方法来提高OOD的鲁棒性:模型修剪,数据增强和结合或零射门评估大型预审预周化模型。但是,仍然对观察有效鲁棒性所需的OOD数据和模型属性的条件尚无清晰的了解。我们通过对多种方法进行全面的经验研究来解决这个问题,这些方法已知会影响OOD鲁棒性,对CIFAR-10和Imagenet的广泛自然和合成分布转移。特别是,我们通过傅立叶镜头观察“有效的鲁棒性难题”,并询问模型和OOD数据的光谱特性如何影响相应的有效鲁棒性。我们发现这个傅立叶镜头提供了一些深入的了解,为什么某些强大的模型,尤其是夹家族的模型,可以实现稳健性。但是,我们的分析还清楚地表明,没有已知的指标始终是对OOD鲁棒性的最佳解释(甚至是强烈的解释)。因此,为了帮助未来对OOD难题的研究,我们通过引入一组预处理的模型(固定的模型),以有效的稳健性(可公开可鲁棒)解决了差距,这些模型(固有的模型)以及不同级别的OOD稳健性。
translated by 谷歌翻译
溶剂基碳捕获系统(CCSS)中的CO2捕获效率尺寸依赖性取决于气体溶剂界面(IA),使IA在CCS设计中的基础攻击最大化。虽然可以通过计算流体动力学(CFD)仿真估计与特定CCS设计的IA,但是使用CFD导出与许多CCS设计相关的IAS,这是昂贵的。幸运的是,以前的工作(如深液)(DF)(Kim等人,2019)表明,通过用神经网络(NN)代理商兑忠实地模仿CFD仿真过程的CFD模拟器来实现大型仿真加速度。这提高了对CFD模拟器的快速,准确更换的可能性,从而有效地逼近CCS设计优化所需的IAS。因此,在这里,我们建立在DF方法中,以开发成功应用于我们复杂的碳捕获CFD模拟的代理。我们优化的DF样式替代商会产生大型加速(4000X),同时获得位于训练配置范围内的未见CCS配置中的IA相对误差低至4%。这提示了NN代理人的CCS设计优化问题的承诺。尽管如此,DF对CCS设计具有固有的局限性(例如,培训模型的有限可转换性至新CCS填料)。我们与思想结束以解决这些挑战。
translated by 谷歌翻译
野外的深度学习(DL)的成功采用需要模型:(1)紧凑,(2)准确,(3)强大的分布换档。不幸的是,同时满足这些要求的努力主要是不成功的。这提出了一个重要问题:无法创建紧凑,准确,强大的深神经网络(卡)基础?为了回答这个问题,我们对流行的模型压缩技术进行了大规模分析,该技术揭示了几种有趣模式。值得注意的是,与传统的修剪方法相比(例如,微调和逐渐修剪),我们发现“彩票式风格”方法令人惊讶地用于生产卡,包括二进制牌。具体而言,我们能够创建极其紧凑的卡,与其较大的对应物相比,具有类似的测试精度和匹配(或更好)的稳健性 - 仅通过修剪和(可选)量化。利用卡的紧凑性,我们开发了一种简单的域 - 自适应测试时间合并方法(卡片 - 甲板),它使用门控模块根据与测试样本的光谱相似性动态地选择相应的卡片。该拟议的方法建立了一个“赢得胜利”的卡片,即在CiFar-10-C精度(即96.8%标准和92.75%的鲁棒)和CiFar-100- C精度(80.6%标准和71.3%的稳健性),内存使用率比非压缩基线(Https://github.com/robustbench/robustbench提供的预制卡和卡片 - 甲板)。最后,我们为我们的理论支持提供了理论支持经验研究结果。
translated by 谷歌翻译
Remote sensing imagery provides comprehensive views of the Earth, where different sensors collect complementary data at different spatial scales. Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a $5.0\%$ non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a $0.9$ mIoU to $3.8$ mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
translated by 谷歌翻译
Traditional screening practices for anxiety and depression pose an impediment to monitoring and treating these conditions effectively. However, recent advances in NLP and speech modelling allow textual, acoustic, and hand-crafted language-based features to jointly form the basis of future mental health screening and condition detection. Speech is a rich and readily available source of insight into an individual's cognitive state and by leveraging different aspects of speech, we can develop new digital biomarkers for depression and anxiety. To this end, we propose a multi-modal system for the screening of depression and anxiety from self-administered speech tasks. The proposed model integrates deep-learned features from audio and text, as well as hand-crafted features that are informed by clinically-validated domain knowledge. We find that augmenting hand-crafted features with deep-learned features improves our overall classification F1 score comparing to a baseline of hand-crafted features alone from 0.58 to 0.63 for depression and from 0.54 to 0.57 for anxiety. The findings of our work suggest that speech-based biomarkers for depression and anxiety hold significant promise in the future of digital health.
translated by 谷歌翻译
This paper addresses the kinodynamic motion planning for non-holonomic robots in dynamic environments with both static and dynamic obstacles -- a challenging problem that lacks a universal solution yet. One of the promising approaches to solve it is decomposing the problem into the smaller sub problems and combining the local solutions into the global one. The crux of any planning method for non-holonomic robots is the generation of motion primitives that generates solutions to local planning sub-problems. In this work we introduce a novel learnable steering function (policy), which takes into account kinodynamic constraints of the robot and both static and dynamic obstacles. This policy is efficiently trained via the policy optimization. Empirically, we show that our steering function generalizes well to unseen problems. We then plug in the trained policy into the sampling-based and lattice-based planners, and evaluate the resultant POLAMP algorithm (Policy Optimization that Learns Adaptive Motion Primitives) in a range of challenging setups that involve a car-like robot operating in the obstacle-rich parking-lot environments. We show that POLAMP is able to plan collision-free kinodynamic trajectories with success rates higher than 92%, when 50 simultaneously moving obstacles populate the environment showing better performance than the state-of-the-art competitors.
translated by 谷歌翻译
Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily.
translated by 谷歌翻译
The xView2 competition and xBD dataset spurred significant advancements in overhead building damage detection, but the competition's pixel level scoring can lead to reduced solution performance in areas with tight clusters of buildings or uninformative context. We seek to advance automatic building damage assessment for disaster relief by proposing an auxiliary challenge to the original xView2 competition. This new challenge involves a new dataset and metrics indicating solution performance when damage is more local and limited than in xBD. Our challenge measures a network's ability to identify individual buildings and their damage level without excessive reliance on the buildings' surroundings. Methods that succeed on this challenge will provide more fine-grained, precise damage information than original xView2 solutions. The best-performing xView2 networks' performances dropped noticeably in our new limited/local damage detection task. The common causes of failure observed are that (1) building objects and their classifications are not separated well, and (2) when they are, the classification is strongly biased by surrounding buildings and other damage context. Thus, we release our augmented version of the dataset with additional object-level scoring metrics https://gitlab.kitware.com/dennis.melamed/xfbd to test independence and separability of building objects, alongside the pixel-level performance metrics of the original competition. We also experiment with new baseline models which improve independence and separability of building damage predictions. Our results indicate that building damage detection is not a fully-solved problem, and we invite others to use and build on our dataset augmentations and metrics.
translated by 谷歌翻译
We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are aware of. The results challenge a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automatic dubbing, arguing for the importance of vocal naturalness and translation quality over commonly emphasized isometric (character length) and lip-sync constraints, and for a more qualified view of the importance of isochronic (timing) constraints. We also find substantial influence of the source-side audio on human dubs through channels other than the words of the translation, pointing to the need for research on ways to preserve speech characteristics, as well as semantic transfer such as emphasis/emotion, in automatic dubbing systems.
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译