论文标题
学习层次图像细分以识别和通过识别
Learning Hierarchical Image Segmentation For Recognition and By Recognition
论文作者
论文摘要
大型视觉和语言模型通过图像文本关联直接学习,通常缺乏详细的视觉证明,而图像分割任务则与识别分开治疗,无需互连而受到监督。我们的主要观察结果是,尽管可以通过多种方式识别图像,但每个图像都具有一个一致的部分视觉组织。因此,分割不应被视为要通过监督学习来掌握的最终任务,而应将其视为一种内部过程,并支持并支持识别的最终目标。我们建议将层次分段整合到识别过程中,训练并仅根据图像级识别目标调整整个模型。我们在识别的同时免费学习层次分割,自动发现不仅支持的部分关系,而且还可以增强识别。通过自适应片段令牌和图形池增强视觉变压器(VIT),我们的模型超过了无监督的零件整体发现,语义分割,图像分类和效率的VIT。值得注意的是,我们的模型(在未标记的1M Imagenet图像上受过训练)在Partimagenet对象分割的MIOU中,在MIOU中的绝对8%优于SAM(对11m图像和10亿个掩码训练)。
Large vision and language models learned directly through image-text associations often lack detailed visual substantiation, whereas image segmentation tasks are treated separately from recognition, supervisedly learned without interconnections. Our key observation is that, while an image can be recognized in multiple ways, each has a consistent part-and-whole visual organization. Segmentation thus should be treated not as an end task to be mastered through supervised learning, but as an internal process that evolves with and supports the ultimate goal of recognition. We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives. We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition. Enhancing the Vision Transformer (ViT) with adaptive segment tokens and graph pooling, our model surpasses ViT in unsupervised part-whole discovery, semantic segmentation, image classification, and efficiency. Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images and 1 billion masks) by absolute 8% in mIoU on PartImageNet object segmentation.