Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories. In contrast, the natural world is heavily imbalanced, as some species are more abundant and easier to photograph than others. To encourage further progress in challenging real world conditions we present the iNaturalist species classification and detection dataset, consisting of 859,000 images from over 5,000 different species of plants and animals. It features visually similar species, captured in a wide variety of situations, from all over the world. Images were collected with different camera types, have varying image quality, feature a large class imbalance, and have been verified by multiple citizen scientists. We discuss the collection of the dataset and present extensive baseline experiments using state-of-the-art computer vision classification and detection models. Results show that current nonensemble based methods achieve only 67% top one classification accuracy, illustrating the difficulty of the dataset. Specifically, we observe poor results for classes with small numbers of training examples suggesting more attention is needed in low-shot learning.
translated by 谷歌翻译
It is desirable for detection and classification algorithms to generalize to unfamiliar environments, but suitable benchmarks for quantitatively studying this phenomenon are not yet available. We present a dataset designed to measure recognition generalization to novel environments. The images in our dataset are harvested from twenty camera traps deployed to monitor animal populations. Camera traps are fixed at one location, hence the background changes little across images; capture is triggered automatically, hence there is no human bias. The challenge is learning recognition in a handful of locations, and generalizing animal detection and classification to new locations where no training data is available. In our experiments state-of-the-art algorithms show excellent performance when tested at the same location where they were trained. However, we find that generalization to new locations is poor, especially for classification systems.
translated by 谷歌翻译
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the chal-
translated by 谷歌翻译
弱监督的对象本地化(WSOL)旨在学习仅使用图像级类别标签编码对象位置的表示形式。但是,许多物体可以在不同水平的粒度标记。它是动物,鸟还是大角的猫头鹰?我们应该使用哪些图像级标签?在本文中,我们研究了标签粒度在WSOL中的作用。为了促进这项调查,我们引入了Inatloc500,这是一个新的用于WSOL的大规模细粒基准数据集。令人惊讶的是,我们发现选择正确的训练标签粒度比选择最佳的WSOL算法提供了更大的性能。我们还表明,更改标签粒度可以显着提高数据效率。
translated by 谷歌翻译
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
translated by 谷歌翻译
Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced 'el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ∼2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.
translated by 谷歌翻译
全球城市可免费获得大量的地理参考全景图像,以及各种各样的城市物体上的位置和元数据的详细地图。它们提供了有关城市物体的潜在信息来源,但是对象检测的手动注释是昂贵,费力和困难的。我们可以利用这种多媒体来源自动注释街道级图像作为手动标签的廉价替代品吗?使用Panorams框架,我们引入了一种方法,以根据城市上下文信息自动生成全景图像的边界框注释。遵循这种方法,我们仅以快速自动的方式从开放数据源中获得了大规模的(尽管嘈杂,但都嘈杂,但对城市数据集进行了注释。该数据集涵盖了阿姆斯特丹市,其中包括771,299张全景图像中22个对象类别的1400万个嘈杂的边界框注释。对于许多对象,可以从地理空间元数据(例如建筑价值,功能和平均表面积)获得进一步的细粒度信息。这样的信息将很难(即使不是不可能)单独根据图像来获取。为了进行详细评估,我们引入了一个有效的众包协议,用于在全景图像中进行边界框注释,我们将其部署以获取147,075个地面真实对象注释,用于7,348张图像的子集,Panorams-clean数据集。对于我们的Panorams-Noisy数据集,我们对噪声以及不同类型的噪声如何影响图像分类和对象检测性能提供了广泛的分析。我们可以公开提供数据集,全景噪声和全景清洁,基准和工具。
translated by 谷歌翻译
The PASCAL Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
translated by 谷歌翻译
本文推动了在图像中分解伪装区域的信封,成了有意义的组件,即伪装的实例。为了促进伪装实例分割的新任务,我们将在数量和多样性方面引入DataSet被称为Camo ++,该数据集被称为Camo ++。新数据集基本上增加了具有分层像素 - 明智的地面真理的图像的数量。我们还为伪装实例分割任务提供了一个基准套件。特别是,我们在各种场景中对新构造的凸轮++数据集进行了广泛的评估。我们还提出了一种伪装融合学习(CFL)伪装实例分割框架,以进一步提高最先进的方法的性能。数据集,模型,评估套件和基准测试将在我们的项目页面上公开提供:https://sites.google.com/view/ltnghia/research/camo_plus_plus
translated by 谷歌翻译
Transferring the knowledge learned from large scale datasets (e.g., ImageNet) via fine-tuning offers an effective solution for domain-specific fine-grained visual categorization (FGVC) tasks (e.g., recognizing bird species or car make & model). In such scenarios, data annotation often calls for specialized domain knowledge and thus is difficult to scale. In this work, we first tackle a problem in large scale FGVC. Our method won first place in iNaturalist 2017 large scale species classification challenge. Central to the success of our approach is a training scheme that uses higher image resolution and deals with the long-tailed distribution of training data. Next, we study transfer learning via fine-tuning from large scale datasets to small scale, domainspecific FGVC datasets. We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure. Our proposed transfer learning outperforms Im-ageNet pre-training and obtains state-of-the-art results on multiple commonly used FGVC datasets.
translated by 谷歌翻译
我们介绍了Caltech Fish计数数据集(CFC),这是一个用于检测,跟踪和计数声纳视频中鱼类的大型数据集。我们将声纳视频识别为可以推进低信噪比计算机视觉应用程序并解决多对象跟踪(MOT)和计数中的域概括的丰富数据来源。与现有的MOT和计数数据集相比,这些数据集主要仅限于城市中的人和车辆的视频,CFC来自自然世界领域,在该域​​中,目标不容易解析,并且无法轻易利用外观功能来进行目标重新识别。 CFC允许​​研究人员训练MOT和计数算法并评估看不见的测试位置的概括性能。我们执行广泛的基线实验,并确定在MOT和计数中推进概括的最新技术的关键挑战和机会。
translated by 谷歌翻译
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pretraining) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-theart results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
translated by 谷歌翻译
我们介绍了RP2K,这是一个新的大型零售产品数据集,用于细粒度图像分类。与以前的数据集专注于相对较少的产品,我们在归属于2000年不同产品的架子上收集超过500,000件零售产品图像。我们的数据集旨在推进零售对象识别的研究,该研究具有大量应用,如自动架审计和基于图像的产品信息检索。我们的数据集享受以下属性:(1)它是迄今为止产品类别的最大规模数据集。(2)所有图像在具有自然灯光的物理零售商店手动捕获,匹配真实应用的场景。(3)我们为每个对象提供丰富的注释,包括尺寸,形状和口味/气味。我们相信我们的数据集可以使计算机视觉研究和零售业受益。我们的数据集在https://www.pinlandata.com/rp2k_dataset上公开提供。
translated by 谷歌翻译
This paper introduces a video dataset of spatiotemporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips.AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
translated by 谷歌翻译
海洋正在经历前所未有的快速变化,在负责任管理所需的时空尺度上,视觉监测海洋生物群是一项艰巨的任务。由于研究界寻求基准,因此所需的数据收集的数量和速率迅速超过了我们处理和分析它们的能力。机器学习的最新进展可以对视觉数据进行快速,复杂的分析,但由于缺乏数据标准化,格式不足以及对大型标签数据集的需求,在海洋中取得了有限的成功。为了满足这一需求,我们构建了Fathomnet,这是一个开源图像数据库,该数据库标准化和汇总了经过精心策划的标记数据。 Fathomnet已被海洋动物,水下设备,碎片和其他概念的现有标志性和非偶像图像所播种,并允许分布式数据源的未来贡献。我们展示了如何使用Fathomnet数据在其他机构视频上训练和部署模型,以减少注释工作,并在与机器人车辆集成时启用自动跟踪水下概念。随着Fathomnet继续增长并结合了社区的更多标记数据,我们可以加速视觉数据以实现健康且可持续的全球海洋。
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
The previous fine-grained datasets mainly focus on classification and are often captured in a controlled setup, with the camera focusing on the objects. We introduce the first Fine-Grained Vehicle Detection (FGVD) dataset in the wild, captured from a moving camera mounted on a car. It contains 5502 scene images with 210 unique fine-grained labels of multiple vehicle types organized in a three-level hierarchy. While previous classification datasets also include makes for different kinds of cars, the FGVD dataset introduces new class labels for categorizing two-wheelers, autorickshaws, and trucks. The FGVD dataset is challenging as it has vehicles in complex traffic scenarios with intra-class and inter-class variations in types, scale, pose, occlusion, and lighting conditions. The current object detectors like yolov5 and faster RCNN perform poorly on our dataset due to a lack of hierarchical modeling. Along with providing baseline results for existing object detectors on FGVD Dataset, we also present the results of a combination of an existing detector and the recent Hierarchical Residual Network (HRN) classifier for the FGVD task. Finally, we show that FGVD vehicle images are the most challenging to classify among the fine-grained datasets.
translated by 谷歌翻译
缺陷增加了建筑项目的成本和持续时间。自动缺陷检测将减少文档工作,这是降低延迟建筑项目的缺陷风险所必需的。由于混凝土是一种广泛使用的建筑材料,因此这项工作着重于检测蜂窝,这是混凝土结构的实质缺陷,甚至可能影响结构完整性。首先,比较图像是从网络上刮下来或从实际实践中获得的。结果表明,Web图像仅代表蜂窝的选择,并且不会捕获完整的差异。其次,对MASK R-CNN和EFIDENENET-B0进行了培训,用于评估实例分割和基于斑块的分类,分别达到47.7%的精度和34.2%的召回率以及68.5%的精度和55.7%的召回率。尽管这些模型的性能不足以完全自动化缺陷检测,但这些模型可用于积极学习中,集成到缺陷文档系统中。总之,CNN可以帮助检测混凝土中的蜂窝。
translated by 谷歌翻译
尽管最先进的物体检测方法显示出令人信服的性能,但模型通常对对抗的攻击和分发数据不稳健。我们介绍了一个新的数据集,天然对手对象(Nao),以评估物体检测模型的稳健性。 Nao包含7,934个图像和9,943个对象,这些对象未经修改,代表了现实世界的情景,但导致最先进的检测模型以高信任误入歧途。与标准MSCOCO验证集相比,在NAO上评估时,高效的平均平均精度(MAP)降低74.5%。此外,通过比较各种对象检测架构,我们发现Mscoco验证集上的更好性能不一定转化为NAO的更好性能,这表明不能通过培训更准确的模型来简单地实现鲁棒性。我们进一步调查为什么NA​​O中难以检测和分类的原因。洗牌图像贴片的实验表明,模型对局部质地过于敏感。此外,使用集成梯度和背景替换,我们发现检测模型依赖于边界框内的像素信息,并且在预测类标签时对背景上下文不敏感。 Nao可以在https://drive.google.com/drive/folders/15p8sowojku6sseihlets86orfytgezi8下载。
translated by 谷歌翻译
开放式对象检测(OSOD)最近引起了广泛的关注。它是在正确检测/分类已知对象的同时检测未知对象。我们首先指出,最近的研究中考虑的OSOD方案,该方案考虑了类似于开放式识别(OSR)的无限种类的未知物体,这是一个基本问题。也就是说,我们无法确定要检测到的内容,而对于这种无限的未知对象,这是检测任务所必需的。这个问题导致了对未知对象检测方法的性能的评估困难。然后,我们介绍了OSOD的新颖方案,该方案仅处理与已知对象共享超级类别的未知对象。它具有许多真实的应用程序,例如检测越来越多的细粒对象。这个新环境摆脱了上述问题和评估困难。此外,由于已知和未知对象之间的视觉相似性,它使检测到未知对象更加现实。我们通过实验结果表明,基于标准检测器类别预测的不确定性的简单方法优于先前设置中测试的当前最新OSOD方法。
translated by 谷歌翻译