In this paper, we present the Multi-view Extended Videos with Identities (MEVID) dataset for large-scale, video person re-identification (ReID) in the wild. To our knowledge, MEVID represents the most-varied video person ReID dataset, spanning an extensive indoor and outdoor environment across nine unique dates in a 73-day window, various camera viewpoints, and entity clothing changes. Specifically, we label the identities of 158 unique people wearing 598 outfits taken from 8, 092 tracklets, average length of about 590 frames, seen in 33 camera views from the very large-scale MEVA person activities dataset. While other datasets have more unique identities, MEVID emphasizes a richer set of information about each individual, such as: 4 outfits/identity vs. 2 outfits/identity in CCVID, 33 viewpoints across 17 locations vs. 6 in 5 simulated locations for MTA, and 10 million frames vs. 3 million for LS-VID. Being based on the MEVA video dataset, we also inherit data that is intentionally demographically balanced to the continental United States. To accelerate the annotation process, we developed a semi-automatic annotation framework and GUI that combines state-of-the-art real-time models for object detection, pose estimation, person ReID, and multi-object tracking. We evaluate several state-of-the-art methods on MEVID challenge problems and comprehensively quantify their robustness in terms of changes of outfit, scale, and background location. Our quantitative analysis on the realistic, unique aspects of MEVID shows that there are significant remaining challenges in video person ReID and indicates important directions for future research.
translated by 谷歌翻译
We introduce a language generation task grounded in a popular video game environment. KNUDGE (KNowledge Constrained User-NPC Dialogue GEneration) involves generating dialogue trees conditioned on an ontology captured in natural language passages providing quest and entity specifications. KNUDGE is constructed from side quest dialogues drawn directly from game data of Obsidian Entertainment's The Outer Worlds, leading to real-world complexities in generation: (1) dialogues are branching trees as opposed to linear chains of utterances; (2) utterances must remain faithful to the game lore--character personas, backstories, and entity relationships; and (3) a dialogue must accurately reveal new quest-related details to the human player. We report results for supervised and in-context learning techniques, finding there is significant room for future work on creating realistic game-quality dialogues.
translated by 谷歌翻译
Are extralinguistic signals such as image pixels crucial for inducing constituency grammars? While past work has shown substantial gains from multimodal cues, we investigate whether such gains persist in the presence of rich information from large language models (LLMs). We find that our approach, LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods on the task of unsupervised constituency parsing, achieving state-of-the-art performance on a variety of datasets. Moreover, LC-PCFG results in an over 50% reduction in parameter count, and speedups in training time of 1.7x for image-aided models and more than 5x for video-aided models, respectively. These results challenge the notion that extralinguistic signals such as image pixels are needed for unsupervised grammar induction, and point to the need for better text-only baselines in evaluating the need of multi-modality for the task.
translated by 谷歌翻译
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
translated by 谷歌翻译
We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.
translated by 谷歌翻译
眼睛的临床诊断是对多种数据模式进行的,包括标量临床标签,矢量化生物标志物,二维底面图像和三维光学相干性层析成像(OCT)扫描。临床从业者使用所有可用的数据模式来诊断和治疗糖尿病性视网膜病(DR)或糖尿病黄斑水肿(DME)等眼部疾病。在眼科医学领域启用机器学习算法的使用需要研究治疗期内所有相关数据之间的关系和相互作用。现有的数据集受到限制,因为它们既不提供数据,也没有考虑数据模式之间的显式关系建模。在本文中,我们介绍了用于研究以上限制的视觉眼睛语义(橄榄)数据集的眼科标签。这是第一个OCT和近IIR眼底数据集,其中包括临床标签,生物标记标签,疾病标签和时间序列的患者治疗信息,来自相关临床试验。该数据集由1268个近红外图像组成,每个图像至少具有49个10月扫描和16个生物标志物,以及4个临床标签和DR或DME的疾病诊断。总共有96张眼睛的数据在至少两年的时间内平均,每只眼睛平均治疗66周和7次注射。我们在医学图像分析中为橄榄数据集进行了橄榄数据集的实用性,并为核心和新兴机器学习范式提供了基准和具体研究方向。
translated by 谷歌翻译
现代神经网络使用构建块,例如与任意2D翻译一样的卷积。但是,这些香草块并不等于投影歧管中的任意3D翻译。即便如此,所有单眼3D检测器都使用香草块来获得3D坐标,这是为此而不是为香草块设计的任务。本文迈出了朝着探索综合的第一步,以在投影歧管中进行任意3D翻译。由于该深度是最难估计的单眼检测,因此本文提出了深度模棱两可的网络(deviant),该网络(deviant)构建了现有的量表等效性的可检测块。结果,偏差与投影歧管中的深度翻译相等,而香草网络却没有。额外的深度竞争力迫使这种偏差学习一致的深度估计,因此,越来越多的人在纯图像类别中的Kitti和Waymo数据集上实现了最新的单眼3D检测结果,并使用额外信息竞争地对方法进行了竞争性执行。此外,在跨数据库评估中,异常比香草网络更好。 https://github.com/abhi1kumar/deviant的代码和模型
translated by 谷歌翻译
在基于文本的分类器中测试公平性问题的一种常见方法是通过使用反事实来:如果更改输入中的敏感属性,则分类器输出是否会更改?现有的反事实生成方法通常依赖于单词列表或模板,产生不考虑语法,上下文或微妙敏感属性引用的简单反事实,并且可能会错过WordList创建者未考虑的问题。在本文中,我们介绍了一项为克服这些缺点而产生的反事实的任务,并证明了如何利用大型语言模型(LLM)来在此任务上取得进展。我们表明,这种基于LLM的方法可以产生现有方法无法实现的复杂反事实,从而比较了民事评论数据集中各种反事实生成方法的性能,并在评估毒性分类器时显示出它们的价值。
translated by 谷歌翻译
我们提出了一种基于优化的新型范式,用于在图像和扫描上拟合3D人类模型。与直接回归输入图像中低维统计体模型(例如SMPL)的参数的现有方法相反,我们训练了每个vertex神经场网络的集合。该网络以分布式的方式预测基于当前顶点投影处提取的神经特征的顶点下降方向。在推断时,我们在梯度降低的优化管道中采用该网络,称为LVD,直到其收敛性为止,即使将所有顶点初始化为单个点,通常也会以一秒钟的分数出现。一项详尽的评估表明,我们的方法能够捕获具有截然不同的身体形状的穿着的人体,与最先进的人相比取得了重大改进。 LVD也适用于人类和手的3D模型配合,为此,我们以更简单,更快的方法对SOTA显示出显着改善。
translated by 谷歌翻译
在本文中,我们提出了一种新的方法来增强从单个可佩戴相机捕获的视频计算的人的3D身体姿势估计。关键的想法是利用在联合嵌入空间中链接第一和第三次视图的高级功能。为了了解这样的嵌入空间,我们介绍了First2第三姿势,这是一个近2,000个视频的新配对同步数据集,描绘了从第一和第三视角捕获的人类活动。我们明确地考虑了空间和运动域功能,同时使用以自我监督的方式培训的半暹罗架构。实验结果表明,使用我们的数据集学习的联合多视图嵌入式空间可用于从任意单视图的自拍视频中提取歧视特征,而无需需要域适应,也不知道相机参数。在三种监督最先进的方法中,我们在两个无约束数据集中实现了重大改善了两个无约束的数据集。我们的数据集和代码将可用于研究目的。
translated by 谷歌翻译