深度多分支聚合网络，用于街头场景中的实时语义细分

论文标题

深度多分支聚合网络，用于街头场景中的实时语义细分

Deep Multi-Branch Aggregation Network for Real-Time Semantic Segmentation in Street Scenes

论文作者

Weng, Xi, Yan, Yan, Dong, Genshun, Shu, Chang, Wang, Biao, Wang, Hanzi, Zhang, Ji

论文摘要

实时的语义细分旨在在实时推理速度下实现高分子的准确性，在过去的几年中，人们一直受到极大的关注。但是，许多最先进的实时语义分割方法倾向于牺牲一些空间细节或上下文信息来快速推断，从而导致分割质量的退化。在本文中，我们基于编码器解码器结构提出了一个新颖的深层多分支聚合网络（称为DMA-NET），以在街头场景中执行实时的语义细分。具体而言，我们首先采用Resnet-18作为编码器，以有效地从不同阶段的卷积阶段生成各种级别的特征图。然后，我们开发一个多分支聚合网络（MAN）作为解码器，以有效地汇总不同级别的特征图并捕获多尺度信息。在人类中，晶格增强的残留块被设计为通过利用晶格结构来增强网络的特征表示。同时，引入了特征转换块，以在特征聚合之前从相邻分支中明确转换特征图。此外，全局上下文块用于利用全局上下文信息。这些关键组件在统一网络中紧密合并并共同优化。关于具有挑战性的城市景观和Camvid数据集的广泛实验结果表明，我们提出的DMA-NET分别以46.7 fps和119.8 fps的推理速度获得了77.0％和73.6％的平均交集，仅通过仅使用单个NVIDIA GTX 1080TX 1080TITI GPU而获得119.8 fps。这表明DMA-NET在街道场景中的语义细分方面提供了良好的折衷方案。

Real-time semantic segmentation, which aims to achieve high segmentation accuracy at real-time inference speed, has received substantial attention over the past few years. However, many state-of-the-art real-time semantic segmentation methods tend to sacrifice some spatial details or contextual information for fast inference, thus leading to degradation in segmentation quality. In this paper, we propose a novel Deep Multi-branch Aggregation Network (called DMA-Net) based on the encoder-decoder structure to perform real-time semantic segmentation in street scenes. Specifically, we first adopt ResNet-18 as the encoder to efficiently generate various levels of feature maps from different stages of convolutions. Then, we develop a Multi-branch Aggregation Network (MAN) as the decoder to effectively aggregate different levels of feature maps and capture the multi-scale information. In MAN, a lattice enhanced residual block is designed to enhance feature representations of the network by taking advantage of the lattice structure. Meanwhile, a feature transformation block is introduced to explicitly transform the feature map from the neighboring branch before feature aggregation. Moreover, a global context block is used to exploit the global contextual information. These key components are tightly combined and jointly optimized in a unified network. Extensive experimental results on the challenging Cityscapes and CamVid datasets demonstrate that our proposed DMA-Net respectively obtains 77.0% and 73.6% mean Intersection over Union (mIoU) at the inference speed of 46.7 FPS and 119.8 FPS by only using a single NVIDIA GTX 1080Ti GPU. This shows that DMA-Net provides a good tradeoff between segmentation quality and speed for semantic segmentation in street scenes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题