论文标题
学会从仅图像文本对生成用于开放世界的语义分割的文本遮罩
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs
论文作者
论文摘要
我们解决开放世界的语义分割,该分段旨在通过仅使用图像文本对而没有密集注释来学习图像中的任意视觉概念。现有的开放世界细分方法通过采用对比度学习(CL)来学习多样化的视觉概念并将学习的图像级理解转移到分段任务,从而显示出令人印象深刻的进步。但是,这些基于CL的方法遭受了火车测试的差异,因为它仅考虑训练过程中的图像文本对齐,而分割需要在测试过程中进行区域文本对齐。在本文中,我们提出了一个新颖的文本基础对比学习(TCL)框架,该框架使模型能够直接学习区域文本对齐。我们的方法生成了给定文本的分割掩码,从蒙版区域提取嵌入文本的图像,并将其与通过TCL嵌入的文本对齐。通过直接学习区域文本对齐,我们的框架鼓励模型直接提高生成的分割口罩的质量。此外,为了进行严格且公平的比较,我们提出了一个统一的评估协议,其中包含大量使用的8个语义分割数据集。 TCL在所有数据集中都具有较大利润的最新零弹药分段性能。代码可在https://github.com/kakaobrain/tcl上找到。
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.