考虑到安全至关重要自动化系统中情境意识的功能,对驾驶场景的风险及其解释性的感知对于自主和合作驾驶特别重要。为了实现这一目标,本文提出了在驾驶场景中的共同风险定位的新研究方向及其作为自然语言描述的风险解释。由于缺乏标准基准,我们收集了一个大规模数据集,戏剧性(带有字幕模块的驾驶风险评估机制),该数据集由17,785个在日本东京收集的互动驾驶场景组成。我们的戏剧数据集适用于带有相关重要对象的驾驶风险的视频和对象级别的问题,以实现视觉字幕的目标,作为一种自由形式的语言描述,利用封闭式和开放式响应用于多层次问题,可以用来使用这些响应,可用于在驾驶场景中评估一系列视觉字幕功能。我们将这些数据提供给社区以进行进一步研究。使用戏剧,我们探索了在互动驾驶场景中的联合风险定位和字幕的多个方面。特别是,我们基准了各种多任务预测架构,并提供了关节风险定位和风险字幕的详细分析。数据集可在https://usa.honda-ri.com/drama上获得
translated by 谷歌翻译
In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs allows SLU systems to improve in comparison to the 1-best setup (4% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, and a relative improvement of 18% over the 1-best configuration. Thus, crossmodal architectures represent a good alternative to overcome the limitations of working purely automatically generated textual data.
translated by 谷歌翻译
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
translated by 谷歌翻译
Biomedical image segmentation is one of the fastest growing fields which has seen extensive automation through the use of Artificial Intelligence. This has enabled widespread adoption of accurate techniques to expedite the screening and diagnostic processes which would otherwise take several days to finalize. In this paper, we present an end-to-end pipeline to segment lungs from chest X-ray images, training the neural network model on the Japanese Society of Radiological Technology (JSRT) dataset, using UNet to enable faster processing of initial screening for various lung disorders. The pipeline developed can be readily used by medical centers with just the provision of X-Ray images as input. The model will perform the preprocessing, and provide a segmented image as the final output. It is expected that this will drastically reduce the manual effort involved and lead to greater accessibility in resource-constrained locations.
translated by 谷歌翻译
Autonomous driving requires efficient reasoning about the location and appearance of the different agents in the scene, which aids in downstream tasks such as object detection, object tracking, and path planning. The past few years have witnessed a surge in approaches that combine the different taskbased modules of the classic self-driving stack into an End-toEnd(E2E) trainable learning system. These approaches replace perception, prediction, and sensor fusion modules with a single contiguous module with shared latent space embedding, from which one extracts a human-interpretable representation of the scene. One of the most popular representations is the Birds-eye View (BEV), which expresses the location of different traffic participants in the ego vehicle frame from a top-down view. However, a BEV does not capture the chromatic appearance information of the participants. To overcome this limitation, we propose a novel representation that captures various traffic participants appearance and occupancy information from an array of monocular cameras covering 360 deg field of view (FOV). We use a learned image embedding of all camera images to generate a BEV of the scene at any instant that captures both appearance and occupancy of the scene, which can aid in downstream tasks such as object tracking and executing language-based commands. We test the efficacy of our approach on synthetic dataset generated from CARLA. The code, data set, and results can be found at https://rebrand.ly/APP OCC-results.
translated by 谷歌翻译
End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword predictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.
translated by 谷歌翻译
视觉惯性导航系统的能力很强大,能够准确估计移动系统在复杂环境中排除全球导航卫星系统的使用。但是,这些导航系统依赖于所使用的传感器的准确和最新的临时校准。因此,这些参数的在线估计器在弹性系统中很有用。本文介绍了现有基于卡尔曼过滤器的框架的扩展,以估算和校准多相机IMU系统的外部参数。除了将过滤器框架扩展到包括多个摄像头传感器外,还重新制定了测量模型以利用通常在基准检测软件中提供的测量数据。使用二级过滤层来估计没有传感器数据的闭环反馈的时间翻译参数。与离线方法相比,使用实验性校准结果,包括使用具有非重叠视野的摄像机来验证滤波器公式的稳定性和准确性。最后,广义过滤代码已经开源,可以在线提供。
translated by 谷歌翻译
概念诱导是基于正式的逻辑推理在描述逻辑上的,已在本体工程中使用,以从基本数据(ABOX)图创建本体(Tbox)公理。在本文中,我们表明它也可以用来解释数据差异,例如在可解释的AI(XAI)的背景下,我们表明它实际上可以以对人类观察者有意义的方式进行。我们的方法利用了从Wikipedia类别层次结构策划的大型层次结构,作为背景知识。
translated by 谷歌翻译
机器人车使用成本图来规划无碰撞路径。与地图中的每个单元相关的成本表示感知的环境信息,这些信息通常是在经过几次反复试验后手动确定的。在越野环境中,由于存在几种类型的功能,将与每个功能相关的成本值进行手工制作是挑战。此外,不同手工制作的成本值可以导致相同环境的不同路径,而不可取的环境。在本文中,我们解决了从感知的稳健车辆路径计划中学习成本图值的问题。我们使用深度学习方法提出了一个名为“骆驼”的新颖框架,该方法通过演示来学习参数,从而为路径规划提供适应性和强大的成本图。骆驼已接受过多模式数据集的培训,例如Rellis-3D。骆驼的评估是在越野场景模拟器(MAV)和IISER-B校园的现场数据上进行的。我们还在地面流动站上执行了骆驼的现实实施。结果表明,在非结构化的地形上没有碰撞的情况下,车辆的灵活而强大的运动。
translated by 谷歌翻译
对图像分类器的最新基于模型的攻击压倒性地集中在单对象(即单个主体对象)图像上。与此类设置不同,我们解决了一个更实用的问题,即使用多对象(即多个主导对象)图像生成对抗性扰动,因为它们代表了大多数真实世界场景。我们的目标是设计一种攻击策略,该策略可以通过利用此类图像中固有的本地贴片差异来从此类自然场景中学习(例如,对象上的局部贴片在“人”上的局部贴片与在交通场景中的对象`自行车'之间的差异)。我们的关键想法是:为了误解对抗性的多对象图像,图像中的每个本地贴片都会使受害者分类器感到困惑。基于此,我们提出了一种新颖的生成攻击(称为局部斑块差异或LPD攻击),其中新颖的对比损失函数使用上述多对象场景特征空间的局部差异来优化扰动生成器。通过各种受害者卷积神经网络的各种实验,我们表明我们的方法在不同的白色盒子和黑色盒子设置下进行评估时,我们的方法优于基线生成攻击,具有高度可转移的扰动。
translated by 谷歌翻译