论文标题

多人3D姿势估计的动态图推理

Dynamic Graph Reasoning for Multi-person 3D Pose Estimation

论文作者

Qiu, Zhongwei, Yang, Qiansheng, Wang, Jian, Fu, Dongmei

论文摘要

多人3D姿势估计是一项具有挑战性的任务,因为遮挡和深度歧义,尤其是在人群场景的情况下。为了解决这些问题,大多数现有方法通过使用图形神经网络增强特征表示或添加结构约束来探索建模的身体上下文提示。但是,这些方法对于单根公式不健壮,该公式从具有预定义的图的根节点中解码3D构成。在本文中,我们提出了GR-M3D,该GR-M3D模拟了\ textbf {m} ulti-person \ textbf {3D}构成构成构成估计,并用动态\ textbf {g} raph \ textbf {r textbf {r} eSounting。预测GR-M3D中的解码图而不是预定。特别是,它首先生成几个数据图,并通过刻度和深度意识到的细化模块(SDAR)增强它们。然后从这些数据图估算每个人的多个根关键点和密集的解码路径。基于它们,动态解码图是通过向解码路径分配路径权重来构建的,而路径权重是从这些增强的数据图推断出来的。此过程被命名为动态图推理(DGR)。最后,根据每个检测到的人的动态解码图对3D姿势进行解码。 GR-M3D可以根据输入数据采用软路径权重,通过采用软路径权重来调整解码图的结构,这使得解码图最能适应不同的输入人员,并且比以前的方法更有能力处理闭塞和深度歧义。我们从经验上表明,提出的自下而上的方法甚至超过了自上而下的方法,并在三个3D姿势数据集上实现了最先进的结果。

Multi-person 3D pose estimation is a challenging task because of occlusion and depth ambiguity, especially in the cases of crowd scenes. To solve these problems, most existing methods explore modeling body context cues by enhancing feature representation with graph neural networks or adding structural constraints. However, these methods are not robust for their single-root formulation that decoding 3D poses from a root node with a pre-defined graph. In this paper, we propose GR-M3D, which models the \textbf{M}ulti-person \textbf{3D} pose estimation with dynamic \textbf{G}raph \textbf{R}easoning. The decoding graph in GR-M3D is predicted instead of pre-defined. In particular, It firstly generates several data maps and enhances them with a scale and depth aware refinement module (SDAR). Then multiple root keypoints and dense decoding paths for each person are estimated from these data maps. Based on them, dynamic decoding graphs are built by assigning path weights to the decoding paths, while the path weights are inferred from those enhanced data maps. And this process is named dynamic graph reasoning (DGR). Finally, the 3D poses are decoded according to dynamic decoding graphs for each detected person. GR-M3D can adjust the structure of the decoding graph implicitly by adopting soft path weights according to input data, which makes the decoding graphs be adaptive to different input persons to the best extent and more capable of handling occlusion and depth ambiguity than previous methods. We empirically show that the proposed bottom-up approach even outperforms top-down methods and achieves state-of-the-art results on three 3D pose datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源