多视图深度估计方法通常需要计算多视图成本体积,这导致巨大的内存消耗和慢速推断。此外,多视图匹配可以失败,对于纹理的表面,反射表面和移动物体。对于这种故障模式,单视深度估计方法通常更可靠。为此,我们提出磁铁,这是一种用多视图几何熔断单视图深度概率的新颖框架,以提高多视图深度估计的精度,稳健性和效率。对于每个帧,磁体估计单视深度概率分布,参数化为像素 - WISE高斯。然后使用对参考帧估计的分布用于对每个像素深度候选进行采样。这种概率采样使网络能够在评估更少的深度候选时获得更高的准确性。我们还提出了对多视图匹配分数的深度一致性加权,以确保多视图深度与单视图预测一致。该方法在SCANNET,7场景和基提上实现了最先进的性能。定性评估表明,我们的方法对抗诸如纹理/反射表面和移动物体的挑战性伪影更加稳健。
translated by 谷歌翻译
这项工作的目的是检测并自动生成视频中异常事件的高级解释。了解异常事件的原因至关重要,因为所需的响应取决于其性质和严重程度。最近的作品通常使用对象或操作分类器来检测和提供异常事件的标签。然而,这将检测系统限制为有限的已知类别,并防止到未知物体或行为的概括。在这里,我们展示了如何在不使用对象或操作分类器的情况下稳健地检测异组织,但仍然恢复事件背后的高级原因。我们提出以下贡献:(1)一种使用显着性图来解除对象和动作分类器的异常事件解释的方法,(2)显示如何使用新的神经架构来学习视频的离散表示来提高显着图的质量通过预测未来帧和(3)将最先进的异常解释方法击败60 \%在公共基准X-MAN数据集的子集上。
translated by 谷歌翻译
本文解决了RGB图像的3D人体形状和姿势估计的问题。该任务最近的一些方法预测了在输入图像上的人体模型参数上的概率分布。这是通过涉及多个3D重建的问题的不良性质,特别是当主体的某些部位局部封闭时,这是一种问题。然而,在广泛使用的身体模型(例如SMPL)中的身体形状参数控制全身表面上的全球变形。通过这些全局形状参数的分布不能有意义地捕获与局部闭塞身体部位相关的形状估计的不确定性。相反,我们提出了一种方法(i)以语义体测量的形式预测局部体形状的分布,并且(ii)使用线性映射来将局部分布转换到身体测量的局部分布到全局分布在SMPL形状参数上的全局分布。我们表明我们的方法在SSP-3D数据集上的身份依赖性身体形状估计精度和磁带测量人类的私有数据集中优于当前的现有技术,通过概率组合局部身体测量分布从主题的多个图像预测。
translated by 谷歌翻译
从单个RGB图像预测3D形状和静态对象的姿势是现代计算机视觉中的重要研究区域。其应用范围从增强现实到机器人和数字内容创建。通常,通过直接对象形状和姿势预测来执行此任务,该任务是不准确的。有希望的研究方向通过从大规模数据库中检索CAD模型并将它们对准到图像中观察到的对象来确保有意义的形状预测。然而,现有的工作并没有考虑到对象几何,导致对象姿态预测不准确,特别是对于未经看法。在这项工作中,我们演示了如何从RGB图像到呈现的CAD模型的跨域Keypoint匹配如何允许更精确的对象姿态预测与通过直接预测所获得的那些相比。我们进一步表明,关键点匹配不仅可以用于估计对象的姿势,还可以用于修改对象本身的形状。这与单独使用对象检索可以实现的准确性是重要的,其固有地限于可用的CAD模型。允许形状适配桥接检索到的CAD模型与观察到的形状之间的间隙。我们在挑战PIX3D数据集上展示了我们的方法。所提出的几何形状预测将AP网格改善在所看到的物体上的33.2至37.8上的33.2至37.8。未经证明对象的8.2至17.1。此外,在遵循所提出的形状适应时,我们展示了更准确的形状预测而不会与CAD模型紧密匹配。代码在HTTPS://github.com/florianlanger/leveraging_geometry_for_shape_eStimation上公开使用。
translated by 谷歌翻译
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3], DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
translated by 谷歌翻译
Figure 1: PoseNet: Convolutional neural network monocular camera relocalization. Relocalization results for an input image (top), the predicted camera pose of a visual reconstruction (middle), shown again overlaid in red on the original image (bottom). Our system relocalizes to within approximately 2m and 6 • for large outdoor scenes spanning 50, 000m 2 . For an online demonstration, please see our project webpage: mi.eng.cam.ac.uk/projects/relocalisation/
translated by 谷歌翻译
This paper proposes a question-answering system that can answer questions whose supporting evidence is spread over multiple (potentially long) documents. The system, called Visconde, uses a three-step pipeline to perform the task: decompose, retrieve, and aggregate. The first step decomposes the question into simpler questions using a few-shot large language model (LLM). Then, a state-of-the-art search engine is used to retrieve candidate passages from a large collection for each decomposed question. In the final step, we use the LLM in a few-shot setting to aggregate the contents of the passages into the final answer. The system is evaluated on three datasets: IIRC, Qasper, and StrategyQA. Results suggest that current retrievers are the main bottleneck and that readers are already performing at the human level as long as relevant passages are provided. The system is also shown to be more effective when the model is induced to give explanations before answering a question. Code is available at \url{https://github.com/neuralmind-ai/visconde}.
translated by 谷歌翻译
A systematic review on machine-learning strategies for improving generalizability (cross-subjects and cross-sessions) electroencephalography (EEG) based in emotion classification was realized. In this context, the non-stationarity of EEG signals is a critical issue and can lead to the Dataset Shift problem. Several architectures and methods have been proposed to address this issue, mainly based on transfer learning methods. 418 papers were retrieved from the Scopus, IEEE Xplore and PubMed databases through a search query focusing on modern machine learning techniques for generalization in EEG-based emotion assessment. Among these papers, 75 were found eligible based on their relevance to the problem. Studies lacking a specific cross-subject and cross-session validation strategy and making use of other biosignals as support were excluded. On the basis of the selected papers' analysis, a taxonomy of the studies employing Machine Learning (ML) methods was proposed, together with a brief discussion on the different ML approaches involved. The studies with the best results in terms of average classification accuracy were identified, supporting that transfer learning methods seem to perform better than other approaches. A discussion is proposed on the impact of (i) the emotion theoretical models and (ii) psychological screening of the experimental sample on the classifier performances.
translated by 谷歌翻译
Plastic shopping bags that get carried away from the side of roads and tangled on cotton plants can end up at cotton gins if not removed before the harvest. Such bags may not only cause problem in the ginning process but might also get embodied in cotton fibers reducing its quality and marketable value. Therefore, it is required to detect, locate, and remove the bags before cotton is harvested. Manually detecting and locating these bags in cotton fields is labor intensive, time-consuming and a costly process. To solve these challenges, we present application of four variants of YOLOv5 (YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x) for detecting plastic shopping bags using Unmanned Aircraft Systems (UAS)-acquired RGB (Red, Green, and Blue) images. We also show fixed effect model tests of color of plastic bags as well as YOLOv5-variant on average precision (AP), mean average precision (mAP@50) and accuracy. In addition, we also demonstrate the effect of height of plastic bags on the detection accuracy. It was found that color of bags had significant effect (p < 0.001) on accuracy across all the four variants while it did not show any significant effect on the AP with YOLOv5m (p = 0.10) and YOLOv5x (p = 0.35) at 95% confidence level. Similarly, YOLOv5-variant did not show any significant effect on the AP (p = 0.11) and accuracy (p = 0.73) of white bags, but it had significant effects on the AP (p = 0.03) and accuracy (p = 0.02) of brown bags including on the mAP@50 (p = 0.01) and inference speed (p < 0.0001). Additionally, height of plastic bags had significant effect (p < 0.0001) on overall detection accuracy. The findings reported in this paper can be useful in speeding up removal of plastic bags from cotton fields before harvest and thereby reducing the amount of contaminants that end up at cotton gins.
translated by 谷歌翻译
Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that cross-encoders largely outperform bi-encoders of similar size in several tasks. In the BEIR benchmark, our largest cross-encoder surpasses a state-of-the-art bi-encoder by more than 4 average points. Finally, we show that using bi-encoders as first-stage retrievers provides no gains in comparison to a simpler retriever such as BM25 on out-of-domain tasks. The code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git
translated by 谷歌翻译