深度估计对于各种重要的现实世界应用至关重要,例如自动驾驶。但是,在高速场景中,它遭受了严重的性能退化,因为传统相机只能捕获模糊的图像。为了解决这个问题,Spike摄像头旨在以高框架速率捕获像素的亮度强度。但是,使用传统的单眼或立体声深度估计算法,使用尖峰摄像机的深度估计仍然非常具有挑战性,这些算法基于光度一致性。在本文中,我们提出了一种新型的不确定性引导深度融合(UGDF)框架,以融合Spike摄像机的单眼和立体声深度估计网络的预测。我们的框架是由于立体声尖峰深度估计在近距离取得更好的结果,而单眼尖峰深度估计获得了更好的结果。因此,我们引入了具有联合培训策略的双任务深度估计结构,并估算了分布式不确定性以融合单眼和立体声结果。为了证明尖峰深度估计比传统的摄像头深度估计的优势,我们为一个名为CitySpike20k的尖峰深度数据集,其中包含20k配对的样品,以进行尖峰深度估计。 UGDF在CitySpike20k上取得了最新的结果,超过了所有单眼或立体声尖峰深度估计基线。我们进行了广泛的实验,以评估我们方法对CitySpike20k的有效性和概括。据我们所知,我们的框架是第一个用于尖峰摄像头深度估算的双任务融合框架。代码和数据集将发布。
translated by 谷歌翻译
神经形态尖峰摄像机以生物启发的方式生成具有高时间分辨率的数据流,该方式在自动驾驶等现实世界应用中具有巨大的潜力。与RGB流相反,Spike流具有克服运动模糊的固有优势,从而导致对高速对象的更准确的深度估计。但是,几乎不可能以监督的方式培训尖峰深度估计网络,因为获得时间密集的尖峰流的配对深度标签非常费力和挑战。在本文中,我们没有构建带有完整深度标签的Spike流数据集,而是以不受监督的方式从开源RGB数据集(例如Kitti)和估算峰值深度转移知识。此类问题的关键挑战在于RGB和SPIKE模式之间的模态差距,以及标记的源RGB和未标记的目标尖峰域之间的域间隙。为了克服这些挑战,我们引入了无监督的尖峰深度估计的跨模式跨域(BICROSS)框架。我们的方法通过引入中介模拟的源尖峰域来缩小源RGB和目标尖峰之间的巨大差距。要具体而言,对于跨模式阶段,我们提出了一种新颖的粗到精细知识蒸馏(CFKD),将图像和像素级知识从源RGB转移到源尖峰。这种设计分别利用了RGB和SPIKE模式的大量语义和密集的时间信息。对于跨域阶段,我们引入了不确定性引导的均值老师(UGMT),以生成具有不确定性估计的可靠伪标签,从而减轻了源尖峰和目标尖峰域之间的变化。此外,我们提出了一种全局级特征对齐方法(GLFA),以对齐两个域之间的特征并生成更可靠的伪标签。
translated by 谷歌翻译
3D地质模型中的每个网格块都需要一种代表该块的所有物理和化学性质的岩石类型。分类岩石类型的性质是岩性,渗透性和毛细管压力。科学家和工程师使用传统的实验室测量确定这些性质,其将破坏性方法嵌入样品或改变其一些性质(即,润湿性,渗透率和孔隙率),因为测量过程包括样品粉碎,流体流动或流体饱和度。最近,数字岩体物理学(DRT)已经出现了从微观计算机断层扫描(UCT)和磁共振成像(MRI)图像中量化这些性质。然而,文献没有尝试以完全数字语境的摇滚打字。我们提出表演数字摇滚打字(DRT):(1)整合最新的DRP在授予数字岩石属性确定的新工艺中; (2)数字化碳酸盐中最新的岩石打字方法,(3)引入了一种新颖的碳酸盐岩字打字过程,该过程利用计算机视觉功能,为异构碳酸岩纹理提供更多洞察力。
translated by 谷歌翻译
面对面对话期间的响应声是社会互动的关键要素,在心理学研究中得到了很好的建立。通过非言语信号响应扬声器的话语,语调或行为实时,听众展示了它们如何从事对话。在这项工作中,我们构建了响应声侦听器数据集(RLD),从公共资源收集的对话视频语料库,其中包括67个扬声器,76个听众,具有三种不同的态度。我们将响应声聆听头生成任务定义为具有运动的运动和表达式的非言语头的合成,包括扬声器的音频和视觉信号。与言语驱动的手势或谈话主管不同,我们在这项任务中介绍了更多的模态,希望有利于几个研究领域,包括人类互动,视频到视频转换,跨模型理解和生成。此外,我们释放了一种态度调节的听力头生成基线。项目页面:\ url {https://project.mhzhou.com/rld}。
translated by 谷歌翻译
渗透性对天然液的流动性具有显性影响。格子Boltzmann模拟器确定纳米和微孔网络的渗透率。模拟器占据了数百万的流动动态计算,其累积的误差和高耗电量的计算能力。为了有效且始终如一地预测渗透性,我们提出了一种形态学解码器,从3D微型计算机层面扫描和核磁共振图像中提出了机器学习的平行和串行流量重建。对于3D视觉,我们将可控可测量的卷引入新的监督分段,其中一组独特的体素强度对应于晶粒和孔喉部尺寸。形态解码器以新颖的方式贬低并汇集形态边界以产生渗透性。形态学解码器方法由五种新方法组成,其中描述了本文,这些新方法是:(1)几何3D渗透率,(2)机器学习引导3D特性识别岩石形态,(3)3D图像特性集成模型的渗透率(4)MRI渗透成像器,(5)形态解码器(整合其他四个新颖过程的过程)。
translated by 谷歌翻译
自动图像处理算法可以提高分类异构碳酸盐岩石形态的质量,效率和一致性,可以无缝地处理大量的数据和图像。地质学家面临困难在设定从岩石图像,微计算断层扫描(UCT)或磁共振成像(MRI)中确定岩石物理性质的最佳方法的方向。大多数成功的工作是来自同质岩石,专注于2D图像,较少关注3D并需要数值模拟。目前,图像分析方法会聚到三种方法:图像处理,人工智能和具有人工智能的组合图像处理。在这项工作中,我们提出了两种方法来确定3D UCT和MRI图像的孔隙率:具有图像分辨率的图像处理方法优化高斯算法(IROGA);高斯随机森林机器学习差异(MLDGRF)启用先进的图像识别方法。我们已经建立了参考3D微型模型和收集的图像以校准Iroga和MLDGRF方法。为了评估这些校准方法的预测能力,我们在3D UCT和天然异质碳酸盐岩的MRI图像上运行它们。我们分别测量了三种行业标准方式的碳酸盐岩的孔隙度和岩性,分别为参考值。值得注意的是,与三种实验测量相比,IROGA和MLDGRF的精度产生96.2%和97.1%的精度为96.2%和97.1%,91.7%和94.4%。我们使用两种方法,X射线粉末衍射和晶粒密度测量测量石灰石和硫铁矿参考值。 MLDGRF生产岩性(石灰石和硫铁矿)卷,精度为97.7%。
translated by 谷歌翻译
In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. Painter significantly outperforms recent generalist models on several challenging tasks. Surprisingly, our model shows capabilities of completing out-of-domain tasks, which do not exist in the training data, such as open-category keypoint detection and object segmentation, validating the powerful task transferability of in-context learning.
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
This paper presents ReasonFormer, a unified reasoning framework for mirroring the modular and compositional reasoning process of humans in complex decision-making. Inspired by dual-process theory in cognitive science, the representation module (automatic thinking) and reasoning modules (controlled thinking) are decoupled to capture different levels of cognition. Upon the top of the representation module, the pre-trained reasoning modules are modular and professional in specific and fundamental reasoning skills (e.g., logic, simple QA, etc). To mimic the controlled compositional thinking process, different reasoning modules are dynamically activated and composed in both parallel and cascaded manners to control what reasoning skills are activated and how deep the reasoning process will be reached to solve the current problems. The unified reasoning framework solves multiple tasks with a single model, and is trained and inferred in an end-to-end manner. Evaluated on 11 datasets requiring different reasoning skills and complexity, ReasonFormer demonstrates substantial performance boosts, revealing the compositional reasoning ability. Few-shot experiments exhibit better generalization ability by learning to compose pre-trained skills for new tasks with limited data, and decoupling the representation module and the reasoning modules. Further analysis shows the modularity of reasoning modules as different tasks activate distinct reasoning skills at different reasoning depths.
translated by 谷歌翻译
沟通可以帮助代理商获得有关他人的信息,以便可以学习更好的协调行为。一些现有的工作会与其他人传达预测的未来轨迹,希望能为其他人做些更好的协调能力提供线索。但是,当对代理人同步处理时,有时会发生循环依赖性,因此很难协调决策。在本文中,我们提出了一种新颖的交流方案,顺序通信(SEQCOMM)。 Seqcomm不同步(高级代理在低级阶段之前做出决定),并有两个通信阶段。在谈判阶段,代理通过传达观测的隐藏状态并比较意图的价值来确定决策的优先级,这是通过对环境动态进行建模来获得的。在发射阶段,高级代理商领导着做出决策并与低级代理商进行交流。从理论上讲,我们证明Seqcomm学到的政策可以单调地改善并融合。从经验上讲,我们表明SEQCOMM在各种多机构合作任务中都优于现有方法。
translated by 谷歌翻译