我们提出了一个临时投票网络(TVNet),用于在未经监控的视频中进行行动定位。这包括一个新的投票证据模块来定位时间边界,更准确地,其中累积时间上下侧证据以预测开始和结束动作边界的帧级概率。我们独立于行动的证据模块纳入管道内,以计算置信度分数和行动课程。我们在ActivityNet-1.3上达到34.6%的平均地图,特别优于以前的方法0.95。TVNET在与PGCN结合和59.1%时,TVCN在0.5 IOU上的PGCN和59.1%上的距离在Thumos14上的距离和所有阈值以前的工作。我们的代码可在https://github.com/hanielwang/tvnet上获得。
translated by 谷歌翻译
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14, where BSN outperforms other state-of-the-art temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the state-of-the-art temporal action detection performance.
translated by 谷歌翻译
时间行动提案生成(TAPG)是一个具有挑战性的任务,旨在在具有时间边界的未经监控视频中找到动作实例。为了评估提案的信任,现有的作品通常预测建议与地面真理之间的时间交叉联盟(TIOO)监督的提案的行动得分。在本文中,我们通过利用背景预测得分来限制提案的信心,创新地提出了一般的辅助背景约束理念,以进一步抑制低质量的建议。以这种方式,可以轻松地将背景约束概念用于现有的TAPG方法(例如,BMN,GTAD)。从这个角度来看,我们提出了背景约束网络(BCNet),以进一步利用行动和背景的丰富信息。具体地,我们介绍了一种动作 - 背景交互模块,用于可靠的置信度评估,它通过帧和剪辑级别的注意机制模拟了动作和背景之间的不一致。在两个流行的基准测试中进行了广泛的实验,即ActivityNet-1.3和Thumos14。结果表明,我们的方法优于最先进的方法。配备现有的Action Classifier,我们的方法还可以在时间动作本地化任务上实现显着性能。
translated by 谷歌翻译
时间动作检测旨在定位视频中的行动边界。基于边界匹配的当前方法枚举并计算生成提案的所有可能的边界匹配。然而,这些方法忽略了边界预测中的远程上下文聚集。同时,由于相邻匹配的类似语义,局部语义聚集的密集产生的匹配不能改善语义丰富和歧视。在本文中,我们提出了名为Dual Contence聚合网络(DCAN)的端到端提议生成方法以聚合两个级别的上下文,即边界级别和提议级别,用于产生高质量的动作提案,从而提高性能时间作用检测。具体而言,我们设计了多路径时间上下文聚合(MTCA),以实现边界级别的平滑上下文聚合和对边界的精确评估。对于匹配评估,粗细匹配(CFM)旨在聚合上下文,并将匹配的映射从粗内进行精细化。我们对ActivityNet V1.3和Thumos-14进行了广泛的实验。 DCAN在ActivityNet V1.3上获得35.39%的平均地图,在Thumos-14上达到地图54.14%,展示DCAN可以产生高质量的提案,实现最先进的性能。我们在https://github.com/cg1177/dcan发布代码。
translated by 谷歌翻译
时间动作本地化旨在预测未修剪长视频中每个动作实例的边界和类别。基于锚或建议的大多数先前方法忽略了整个视频序列中的全局本地上下文相互作用。此外,他们的多阶段设计无法直接生成动作边界和类别。为了解决上述问题,本文提出了一种新颖的端到端模型,称为自适应感知变压器(简称apperformer)。具体而言,Adaperformer探索了双支球多头的自我发项机制。一个分支会照顾全球感知的关注,该注意力可以模拟整个视频序列并汇总全球相关环境。而其他分支集中于局部卷积转移,以通过我们的双向移动操作来汇总框架内和框架间信息。端到端性质在没有额外步骤的情况下产生视频动作的边界和类别。提供了广泛的实验以及消融研究,以揭示我们设计的有效性。我们的方法在Thumos14数据集上实现了最先进的准确性(根据map@0.5、42.6 \%map@0.7和62.7 \%map@avg),并在活动网络上获得竞争性能, -1.3数据集,平均地图为36.1 \%。代码和型号可在https://github.com/soupero/adaperformer上找到。
translated by 谷歌翻译
Temporal action proposal generation is an challenging and promising task which aims to locate temporal regions in real-world videos where action or event may occur. Current bottom-up proposal generation methods can generate proposals with precise boundary, but cannot efficiently generate adequately reliable confidence scores for retrieving proposals. To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map. Based on BM mechanism, we propose an effective, efficient and end-to-end proposal generation method, named Boundary-Matching Network (BMN), which generates proposals with precise temporal boundaries as well as reliable confidence scores simultaneously. The two-branches of BMN are jointly trained in an unified framework. We conduct experiments on two challenging datasets: THUMOS-14 and ActivityNet-1.3, where BMN shows significant performance improvement with remarkable efficiency and generalizability. Further, combining with existing action classifier, BMN can achieve stateof-the-art temporal action detection performance.
translated by 谷歌翻译
在计算机视觉中长期以来一直研究了时间行动定位。现有的最先进的动作定位方法将每个视频划分为多个动作单位(即,在一级方法中的两级方法和段中的提案),然后单独地对每个视频进行操作,而不明确利用他们在学习期间的关系。在本文中,我们声称,动作单位之间的关系在行动定位中发挥着重要作用,并且更强大的动作探测器不仅应捕获每个动作单元的本地内容,还应允许更广泛的视野与相关的上下文它。为此,我们提出了一般图表卷积模块(GCM),可以轻松插入现有的动作本地化方法,包括两阶段和单级范式。具体而言,我们首先构造一个图形,其中每个动作单元被表示为节点,并且两个动作单元之间作为边缘之间的关系。在这里,我们使用两种类型的关系,一个类型的关系,用于捕获不同动作单位之间的时间连接,另一类是用于表征其语义关系的另一个关系。特别是对于两级方法中的时间连接,我们进一步探索了两种不同的边缘,一个连接重叠动作单元和连接周围但脱节的单元的另一个。在我们构建的图表上,我们将图形卷积网络(GCNS)应用于模拟不同动作单位之间的关系,这能够了解更有信息的表示来增强动作本地化。实验结果表明,我们的GCM始终如一地提高了现有行动定位方法的性能,包括两阶段方法(例如,CBR和R-C3D)和一级方法(例如,D-SSAD),验证我们的一般性和有效性GCM。
translated by 谷歌翻译
时间动作本地化(TAL)是识别视频中一组动作的任务,该任务涉及将开始和终点定位并对每个操作实例进行分类。现有方法通过使用预定义的锚窗或启发式自下而上的边界匹配策略来解决此任务,这些策略是推理时间的主要瓶颈。此外,主要的挑战是由于缺乏全球上下文信息而无法捕获远程动作。在本文中,我们介绍了一个无锚的框架,称为HTNET,该框架预测了一组<开始时间,结束时间,类,类>三胞胎,这些视频基于变压器体系结构。在预测粗边界之后,我们通过背景特征采样(BFS)模块和分层变压器对其进行完善,这使我们的模型能够汇总全局上下文信息,并有效利用视频中固有的语义关系。我们演示了我们的方法如何在两个TAL基准数据集上定位准确的动作实例并实现最先进的性能:Thumos14和ActivityNet 1.3。
translated by 谷歌翻译
时间动作本地化的主要挑战是在未修剪的视频中从各种共同出现的成分(例如上下文和背景)中获取细微的人类行为。尽管先前的方法通过设计高级动作探测器取得了重大进展,但它们仍然遭受这些共发生的成分,这些成分通常占据视频中实际动作内容。在本文中,我们探讨了视频片段的两个正交但互补的方面,即动作功能和共存功能。尤其是,我们通过在视频片段中解开这两种功能并重新组合它们来生成具有更明显的动作信息以进行准确的动作本地化的新功能表示形式,从而开发了一项新颖的辅助任务。我们称我们的方法重新处理,该方法首先显式将动作内容分解并正规化其共发生的特征,然后合成新的动作主导的视频表示形式。对Thumos14和ActivityNet V1.3的广泛实验结果和消融研究表明,我们的新表示形式与简单的动作检测器相结合可以显着改善动作定位性能。
translated by 谷歌翻译
Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.
translated by 谷歌翻译
已经发现,旨在在未修剪视频的开始和终点范围内发现的时间动作实例的时间动作提案生成可以在很大程度上受益于适当的时间和语义上下文的剥削。最新的努力致力于通过自我发项模块来考虑基于时间的环境和基于相似性的语义上下文。但是,他们仍然遭受混乱的背景信息和有限的上下文特征学习的困扰。在本文中,我们提出了一个基于金字塔区域的新型插槽注意(PRSLOT)模块来解决这些问题。我们的PRSLOT模块不使用相似性计算,而是直接以编码器方式来学习本地关系,并基于注意力输入功能(称为\ textit {slot}}的注意力输入功能,生成了局部区域的表示。具体而言,在输入段级级别上,PRSLOT模块将目标段作为\ textIt {query},其周围区域为\ textit {key},然后通过聚集每个\ textit {query-key}插槽来生成插槽表示。具有平行金字塔策略的本地摘要上下文。基于PRSLOT模块,我们提出了一种基于金字塔区域的新型插槽注意网络,称为PRSA-NET,以学习具有丰富的时间和语义上下文的统一视觉表示,以获得更好的建议生成。广泛的实验是在两个广泛采用的Thumos14和ActivityNet-1.3基准上进行的。我们的PRSA-NET优于其他最先进的方法。特别是,我们将AR@100从以前的最佳50.67%提高到56.12%,以生成提案,并在0.5 TIOU下将地图从51.9 \%\%提高到58.7 \%\%\%\%\%,以在Thumos14上进行动作检测。 \ textit {代码可在} \ url {https://github.com/handhand123/prsa-net}中获得
translated by 谷歌翻译
未经监控视频中的弱监督时间行动本地化(WTAL)已成为实际但具有挑战性的任务,因为只有视频级标签。现有方法通常利用现成的分段级别特征,这些特征患有空间不完整性和时间不一致,从而限制了它们的性能。在本文中,我们通过使用简单但有效的图表卷积网络增强段级表示,即动作补充图网络(ACGNET)来从新的角度来解决这个问题。它促进了当前的视频段来从其他潜在传达互补线索的其他人感知空间时间依赖性,隐含地减轻由上述两个问题引起的负面影响。通过这种方式,分段级别特征是对空间时间变化的更具判别和鲁棒性的,有助于较高的定位精度。更重要的是,所提出的ACGNET作为通用模块,可以灵活插入不同的WTAL框架,同时保持端到端的培训方式。在Thumos'14和ActivityNET1.2基准上进行了广泛的实验,其中最先进的结果清楚地证明了所提出的方法的优越性。
translated by 谷歌翻译
We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the localization network are used during prediction. On two largescale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.
translated by 谷歌翻译
To balance the annotation labor and the granularity of supervision, single-frame annotation has been introduced in temporal action localization. It provides a rough temporal location for an action but implicitly overstates the supervision from the annotated-frame during training, leading to the confusion between actions and backgrounds, i.e., action incompleteness and background false positives. To tackle the two challenges, in this work, we present the Snippet Classification model and the Dilation-Erosion module. In the Dilation-Erosion module, we expand the potential action segments with a loose criterion to alleviate the problem of action incompleteness and then remove the background from the potential action segments to alleviate the problem of action incompleteness. Relying on the single-frame annotation and the output of the snippet classification, the Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds, which in turn further trains the Snippet Classification model. It forms a cyclic dependency. Furthermore, we propose a new embedding loss to aggregate the features of action instances with the same label and separate the features of actions from backgrounds. Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method. Code has been made publicly available (https://github.com/LingJun123/single-frame-TAL).
translated by 谷歌翻译
分散注意力的驾驶每年会导致数千人死亡,以及如何应用深度学习的方法来防止这些悲剧已成为一个关键问题。在第六AI城市挑战赛的Track3中,研究人员提供了一个具有密集动作注释的高质量视频数据集。由于数据量表和不清楚的动作边界,数据集提出了一个独特的挑战,可以精确地本地化所有不同的动作并对其类别进行分类。在本文中,我们充分利用了视频之间的多视图同步,并进行了强大的多视图实践(MVP)来驱动动作本地化。为了避免过度拟合,我们将Slowfast用动力学-700预训练作为特征提取器进行微调。然后,不同视图的功能将传递给ActionFormer,以生成候选行动建议。为了精确地本地化所有动作,我们设计了精心设计的后处理,包括模型投票,阈值过滤和删除重复。结果表明,我们的MVP对于驱动动作定位是可靠的,在Track3测试集中达到28.49%的F1分数。
translated by 谷歌翻译
Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures. 1
translated by 谷歌翻译
时间动作本地化在视频分析中起着重要作用,该视频分析旨在将动作定位和分类在未修剪视频中。先前的方法通常可以预测单个时间尺度的特征空间上的动作。但是,低级量表的时间特征缺乏足够的语义来进行动作分类,而高级尺度则无法提供动作边界的丰富细节。为了解决这个问题,我们建议预测多个颞尺度特征空间的动作。具体而言,我们使用不同尺度的精致特征金字塔将语义从高级尺度传递到低级尺度。此外,为了建立整个视频的长时间尺度,我们使用时空变压器编码器来捕获视频帧的远程依赖性。然后,具有远距离依赖性的精制特征被送入分类器以进行粗糙的动作预测。最后,为了进一步提高预测准确性,我们建议使用框架级别的自我注意模块来完善每个动作实例的分类和边界。广泛的实验表明,所提出的方法可以超越Thumos14数据集上的最先进方法,并在ActivityNet1.3数据集上实现可比性的性能。与A2NET(tip20,avg \ {0.3:0.7 \}),sub-action(csvt2022,avg \ {0.1:0.5 \})和afsd(cvpr21,avg \ {0.3:0.7 \}) ,提出的方法分别可以提高12.6 \%,17.4 \%和2.2 \%
translated by 谷歌翻译
The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation-to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.
translated by 谷歌翻译
基于自我注意力的变压器模型已显示出令人印象深刻的图像分类和对象检测结果,并且最近用于视频理解。受此成功的启发,我们研究了变压器网络在视频中的时间动作本地化的应用。为此,我们提出了ActionFormer,这是一个简单而强大的模型,可在不使用动作建议或依靠预定义的锚点窗口中识别其及时识别其类别并识别其类别。 ActionFormer将多尺度特征表示与局部自我发作相结合,并使用轻加权解码器对每个时刻进行分类并估算相应的动作边界。我们表明,这种精心策划的设计会在先前的工作中进行重大改进。如果没有铃铛和口哨声,ActionFormer在Thumos14上的TIOU = 0.5的地图达到了71.0%的地图,表现优于最佳先前模型的绝对百分比14.1。此外,ActionFormer在ActivityNet 1.3(平均地图36.6%)和Epic-Kitchens 100(+先前工作的平均地图+13.5%)上显示出很强的结果。我们的代码可从http://github.com/happyharrycn/actionformer_release获得。
translated by 谷歌翻译
基于文本的视频细分旨在通过用文本查询指定演员及其表演动作来细分视频序列中的演员。由于\ emph {emph {语义不对称}的问题,以前的方法无法根据演员及其动作以细粒度的方式将视频内容与文本查询对齐。 \ emph {语义不对称}意味着在多模式融合过程中包含不同量的语义信息。为了减轻这个问题,我们提出了一个新颖的演员和动作模块化网络,该网络将演员及其动作分别定位在两个单独的模块中。具体来说,我们首先从视频和文本查询中学习与参与者相关的内容,然后以对称方式匹配它们以定位目标管。目标管包含所需的参与者和动作,然后将其送入完全卷积的网络,以预测演员的分割掩模。我们的方法还建立了对象的关联,使其与所提出的时间建议聚合机制交叉多个框架。这使我们的方法能够有效地细分视频并保持预测的时间一致性。整个模型允许联合学习参与者的匹配和细分,并在A2D句子和J-HMDB句子数据集上实现单帧细分和完整视频细分的最新性能。
translated by 谷歌翻译