多人姿势估计具有增强的特征聚合和选择

论文标题

多人姿势估计具有增强的特征聚合和选择

Multi-Person Pose Estimation with Enhanced Feature Aggregation and Selection

论文作者

Xu, Xixia, Zou, Qi, Lin, Xue

论文摘要

我们提出了一种新型增强的特征聚合和选择网络（EFASNET），以进行多人2D人姿势估计。由于功能表示增强，我们的方法可以很好地处理拥挤，混乱和遮挡的场景。更具体地说，构建层次多尺度特征聚合并使聚合特征歧视的特征聚合和选择模块（FASM）被提议获得更准确的细粒度表示，从而导致更精确的关节位置。然后，我们执行一种简单的特征融合（FF）策略，该策略有效地融合了高分辨率的空间特征和低分辨率的语义特征，以获取更可靠的上下文信息以供良好估计的关节。最后，我们建立一个密集的UP采样卷积（DUC）模块，以生成更精确的预测，该预测可以恢复通常在常见的UP采样过程中无法获得的缺失的联合细节。结果，预测的关键点热图更准确。全面的实验表明，所提出的方法的表现优于最先进的方法，并实现了超过三个基准数据集的出色性能：最近的大数据集人群，可可关键点检测数据集和MPII人类姿势数据集。我们的代码将在接受后发布。

We propose a novel Enhanced Feature Aggregation and Selection network (EFASNet) for multi-person 2D human pose estimation. Due to enhanced feature representation, our method can well handle crowded, cluttered and occluded scenes. More specifically, a Feature Aggregation and Selection Module (FASM), which constructs hierarchical multi-scale feature aggregation and makes the aggregated features discriminative, is proposed to get more accurate fine-grained representation, leading to more precise joint locations. Then, we perform a simple Feature Fusion (FF) strategy which effectively fuses high-resolution spatial features and low-resolution semantic features to obtain more reliable context information for well-estimated joints. Finally, we build a Dense Upsampling Convolution (DUC) module to generate more precise prediction, which can recover missing joint details that are usually unavailable in common upsampling process. As a result, the predicted keypoint heatmaps are more accurate. Comprehensive experiments demonstrate that the proposed approach outperforms the state-of-the-art methods and achieves the superior performance over three benchmark datasets: the recent big dataset CrowdPose, the COCO keypoint detection dataset and the MPII Human Pose dataset. Our code will be released upon acceptance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题