Being able to forecast the popularity of new garment designs is very important in an industry as fast paced as fashion, both in terms of profitability and reducing the problem of unsold inventory. Here, we attempt to address this task in order to provide informative forecasts to fashion designers within a virtual reality designer application that will allow them to fine tune their creations based on current consumer preferences within an interactive and immersive environment. To achieve this we have to deal with the following central challenges: (1) the proposed method should not hinder the creative process and thus it has to rely only on the garment's visual characteristics, (2) the new garment lacks historical data from which to extrapolate their future popularity and (3) fashion trends in general are highly dynamical. To this end, we develop a computer vision pipeline fine tuned on fashion imagery in order to extract relevant visual features along with the category and attributes of the garment. We propose a hierarchical label sharing (HLS) pipeline for automatically capturing hierarchical relations among fashion categories and attributes. Moreover, we propose MuQAR, a Multimodal Quasi-AutoRegressive neural network that forecasts the popularity of new garments by combining their visual features and categorical features while an autoregressive neural network is modelling the popularity time series of the garment's category and attributes. Both the proposed HLS and MuQAR prove capable of surpassing the current state-of-the-art in key benchmark datasets, DeepFashion for image classification and VISUELLE for new garment sales forecasting.
translated by 谷歌翻译
In this paper, we address the problem of image splicing localization with a multi-stream network architecture that processes the raw RGB image in parallel with other handcrafted forensic signals. Unlike previous methods that either use only the RGB images or stack several signals in a channel-wise manner, we propose an encoder-decoder architecture that consists of multiple encoder streams. Each stream is fed with either the tampered image or handcrafted signals and processes them separately to capture relevant information from each one independently. Finally, the extracted features from the multiple streams are fused in the bottleneck of the architecture and propagated to the decoder network that generates the output localization map. We experiment with two handcrafted algorithms, i.e., DCT and Splicebuster. Our proposed approach is benchmarked on three public forensics datasets, demonstrating competitive performance against several competing methods and achieving state-of-the-art results, e.g., 0.898 AUC on CASIA.
translated by 谷歌翻译
The sheer volume of online user-generated content has rendered content moderation technologies essential in order to protect digital platform audiences from content that may cause anxiety, worry, or concern. Despite the efforts towards developing automated solutions to tackle this problem, creating accurate models remains challenging due to the lack of adequate task-specific training data. The fact that manually annotating such data is a highly demanding procedure that could severely affect the annotators' emotional well-being is directly related to the latter limitation. In this paper, we propose the CM-Refinery framework that leverages large-scale multimedia datasets to automatically extend initial training datasets with hard examples that can refine content moderation models, while significantly reducing the involvement of human annotators. We apply our method on two model adaptation strategies designed with respect to the different challenges observed while collecting data, i.e. lack of (i) task-specific negative data or (ii) both positive and negative data. Additionally, we introduce a diversity criterion applied to the data collection process that further enhances the generalization performance of the refined models. The proposed method is evaluated on the Not Safe for Work (NSFW) and disturbing content detection tasks on benchmark datasets achieving 1.32% and 1.94% accuracy improvements compared to the state of the art, respectively. Finally, it significantly reduces human involvement, as 92.54% of data are automatically annotated in case of disturbing content while no human intervention is required for the NSFW task.
translated by 谷歌翻译
In this paper, we introduce MINTIME, a video deepfake detection approach that captures spatial and temporal anomalies and handles instances of multiple people in the same video and variations in face sizes. Previous approaches disregard such information either by using simple a-posteriori aggregation schemes, i.e., average or max operation, or using only one identity for the inference, i.e., the largest one. On the contrary, the proposed approach builds on a Spatio-Temporal TimeSformer combined with a Convolutional Neural Network backbone to capture spatio-temporal anomalies from the face sequences of multiple identities depicted in a video. This is achieved through an Identity-aware Attention mechanism that attends to each face sequence independently based on a masking operation and facilitates video-level aggregation. In addition, two novel embeddings are employed: (i) the Temporal Coherent Positional Embedding that encodes each face sequence's temporal information and (ii) the Size Embedding that encodes the size of the faces as a ratio to the video frame size. These extensions allow our system to adapt particularly well in the wild by learning how to aggregate information of multiple identities, which is usually disregarded by other methods in the literature. It achieves state-of-the-art results on the ForgeryNet dataset with an improvement of up to 14% AUC in videos containing multiple people and demonstrates ample generalization capabilities in cross-forgery and cross-dataset settings. The code is publicly available at https://github.com/davide-coccomini/MINTIME-Multi-Identity-size-iNvariant-TIMEsformer-for-Video-Deepfake-Detection.
translated by 谷歌翻译
班级失衡对机器学习构成了重大挑战,因为大多数监督学习模型可能对多数级别和少数族裔表现不佳表现出偏见。成本敏感的学习通过以不同的方式处理类别,通常通过用户定义的固定错误分类成本矩阵来解决此问题,以提供给学习者的输入。这种参数调整是一项具有挑战性的任务,需要域知识,此外,错误的调整可能会导致整体预测性能恶化。在这项工作中,我们为不平衡数据提出了一种新颖的成本敏感方法,该方法可以动态地调整错误分类的成本,以响应Model的性能,而不是使用固定的错误分类成本矩阵。我们的方法称为ADACC,是无参数的,因为它依赖于增强模型的累积行为,以便调整下一次增强回合的错误分类成本,并具有有关培训错误的理论保证。来自不同领域的27个现实世界数据集的实验表明,我们方法的优势超过了12种最先进的成本敏感方法,这些方法在不同度量方面表现出一致的改进,例如[0.3] AUC的%-28.56%],平衡精度[3.4%-21.4%],Gmean [4.8%-45%]和[7.4%-85.5%]用于召回。
translated by 谷歌翻译
为了将时尚服装视为美学上的令人愉悦,构成它们的服装需要在视觉方面(例如样式,类别和颜色)兼容。随着计算机视觉深度学习模型的出现和无所不知,人们对视觉兼容检测的任务也增加了兴趣,目的是开发优质的时尚服装推荐系统。先前的作品将视觉兼容性定义为二进制分类任务,而衣服中的项目被认为是完全兼容或完全不相容的。但是,这不适用于用户创建自己的服装的服装制造商应用程序,并且需要知道哪些特定项目可能与其余的服装不相容。为了解决这个问题,我们提出了针对两个任务进行优化的视觉不兼容变压器(Victor):1)总体兼容性作为回归和2)检测不匹配项目。与以前的作品依赖于来自Imagenet预测模型的功能提取或端到端的微调不同,我们利用了时尚特定于时尚的对比语言图像预训练来进行微调计算机视觉神经网络在时尚图像上。此外,我们基于Polyvore Outfit基准测试,以产生部分不匹配的服装,创建一个称为Polyvore-Misfits的新数据集,该数据集用于训练Victor。一系列消融和比较分析表明,所提出的体系结构可以竞争,甚至超过Polyvore数据集上的最新目前,同时将实例的浮动操作减少88%,从而在高性能和效率之间达到平衡。
translated by 谷歌翻译
在这项工作中,我们的目标是将非结构化的点对点网络的节点与通信不确定性进行分类,例如分散的社交网络的用户。已知图形神经网络(GNNS)通过利用自然发生的网络链路来提高集中设置中更简单的分类器的准确性,但是当节点邻居不断可用时,图形卷积层在分散的设置中实现了在分散的设置中实现了具有挑战性的。我们通过采用分离的GNN来解决这个问题,其中基本分类器预测和错误通过训练之后通过图来扩散。为此,我们部署了预先训练和八卦培训的基本分类器,并在通信不确定性下实现对等图形扩散。特别地,我们开发了一种异步分散的扩散制剂,其在相对于通信速率线性地收敛于相同的预测。我们在具有节点特征和标签的三个实际图表上尝试,并使用均匀随机通信频率模拟点对点网络;给定一部分已知的标签,我们的分散的图形扩散实现了集中GNN的可比精度。
translated by 谷歌翻译
可靠的图像地理定位对于若干应用来说至关重要,从社交媒体地理标记到假新闻检测。最先进的地理定位方法超越了图像从图像的地理位置估算的任务。但是,没有方法评估图像的适用性,这导致不含地理位置线索的图像的不可靠和错误的估计。在本文中,我们定义了图像定位的任务,即地理位置图像的适用性,并提出了一种选择性预测方法来解决任务。特别是,我们提出了两个新颖的选择功能,利用地理定位模型的输出概率分布来推断出不同尺度的定位。我们的选择功能与最广泛使用的选择性预测基线进行基准测试,在所有情况下都表现优于它们。通过弃权预测不可定位的图像,我们将地理位置精度从城市规模提高到70.5%,从而使当前的地理位置模型可靠地对现实世界应用。
translated by 谷歌翻译
计算机视觉(CV)取得了显着的结果,在几个任务中表现优于人类。尽管如此,如果不正确处理,可能会导致重大歧视,因为CV系统高度依赖于他们所用的数据,并且可以在此类数据中学习和扩大偏见。因此,理解和发现偏见的问题至关重要。但是,没有关于视觉数据集中偏见的全面调查。因此,这项工作的目的是:i)描述可能在视觉数据集中表现出来的偏差; ii)回顾有关视觉数据集中偏置发现和量化方法的文献; iii)讨论现有的尝试收集偏见视觉数据集的尝试。我们研究的一个关键结论是,视觉数据集中发现和量化的问题仍然是开放的,并且在方法和可以解决的偏见范围方面都有改进的余地。此外,没有无偏见的数据集之类的东西,因此科学家和从业者必须意识到其数据集中的偏见并使它们明确。为此,我们提出了一个清单,以在Visual DataSet收集过程中发现不同类型的偏差。
translated by 谷歌翻译
在本文中,我们解决了大型数据集中的高性能和基于计算有效的基于内容的视频检索问题。当前方法通常提出:(i)采用时空表示和相似性计算的细粒度方法,以高计算成本以高性能获得高性能,或(ii)代表/索引视频作为全球向量的粗粒粒度方法,其中时空 - 时间结构丢失,提供较低的性能,但计算成本也很低。在这项工作中,我们提出了一个知识蒸馏框架,称为Distill-Select(DNS),该框架从表现良好的细颗粒教师网络开始学习:a)具有不同检索性能和计算效率折衷和计算效率的学生网络b)在测试时间迅速将样本引导到合适的学生以保持高检索性能和高计算效率的选择网络。我们培训几个具有不同架构的学生,并得出不同的性能和效率的不同权衡,即速度和存储要求,包括使用二进制表示的精细颗粒学生。重要的是,提出的计划允许在大型,未标记的数据集中进行知识蒸馏 - 这导致了好学生。我们在三个不同的视频检索任务上评估了五个公共数据集的DNS,并证明a)我们的学生在几种情况下达到最先进的性能,b)b)DNS框架在检索性能,计算中提供了极好的权衡速度和存储空间。在特定的配置中,所提出的方法可以通过老师获得相似的地图,但要快20倍,需要减少240倍的存储空间。收集到的数据集和实施已公开可用:https://github.com/mever-team/distill-and-select。
translated by 谷歌翻译