您可以从肌肉中学到什么？从人类互动中学习视觉表示

论文标题

您可以从肌肉中学到什么？从人类互动中学习视觉表示

What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions

论文作者

Ehsani, Kiana, Gordon, Daniel, Nguyen, Thomas, Mottaghi, Roozbeh, Farhadi, Ali

论文摘要

学习有效的视觉数据的有效表示，将各种下游任务推广到计算机视觉一直是一个漫长的追求。大多数表示学习方法仅依赖于图像或视频等视觉数据。在本文中，我们探讨了一种新颖的方法，在该方法中，我们使用人类的互动和注意线索来研究与纯视觉表示相比，我们是否可以学习更好的表示。在这项研究中，我们收集了人类互动的数据集，以捕捉身体部位运动并凝视日常生活。我们的实验表明，在各种目标任务上，我们的“肌肉监督”表示的互动和注意力提示优于一种仅视觉的最先进的方法Moco（He等，2020），在各种目标任务上：场景分类（语义）（语义），动作识别（时间识别），深度估计（seeprics估算），表面估计（POSTHATICT），图表（物理）和漫步（物理）和绘制（物理）和次数（物理）和行动（物理）。我们的代码和数据集可在以下网址找到：https：//github.com/ehsanik/muscletorch。

Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our "muscly-supervised" representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo (He et al.,2020), on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance). Our code and dataset are available at: https://github.com/ehsanik/muscleTorch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题