自由形式文本描述中的静态和动画3D场景生成

论文标题

自由形式文本描述中的静态和动画3D场景生成

Static and Animated 3D Scene Generation from Free-form Text Descriptions

论文作者

Huq, Faria, Ahmed, Nafees, Iqbal, Anindya

论文摘要

从技术上讲，从自由形式的文本描述中生成相干且有用的图像/视频场景是一个非常困难的问题。同一场景的文字描述因人而异，甚至可能不时对于同一个人而异。随着单词和语法的选择在准备文本描述时会有所不同，因此系统可靠地从不同形式的语言输入中可靠地产生一致的理想输出是一项挑战。场景生成的先前作品主要仅限于严格的文本输入句子结构，这些结构限制了用户自由编写描述的自由。在我们的工作中，我们研究了一条新管道，旨在从不同类型的自由形式的文本场景描述中生成静态和动画3D场景，而无需任何重大限制。特别是，为了保持我们的研究实用和可进行，我们专注于所有可能的3D场景的小子空间，其中包含多个立方体，气缸和球体的组合。我们设计了两阶段的管道。在第一阶段，我们使用编码器解码器神经体系结构编码自由形式文本。在第二阶段，我们根据生成的编码生成一个3D场景。我们的神经体系结构利用最新的语言模型作为编码器，以利用丰富的上下文编码和新的多头解码器来同时预测对象中对象的多个特征。对于我们的实验，我们生成了一个大型合成数据集，该数据集分别包含13,00,000和14,00,000个独特的静态和动画场景描述。我们在成功检测3D对象功能的测试数据集上达到了98.427％的精度。我们的工作显示了一种解决问题的方法的概念证明，我们相信，通过足够的培训数据，可以扩展相同的管道以解决更广泛的3D场景生成问题。

Generating coherent and useful image/video scenes from a free-form textual description is technically a very difficult problem to handle. Textual description of the same scene can vary greatly from person to person, or sometimes even for the same person from time to time. As the choice of words and syntax vary while preparing a textual description, it is challenging for the system to reliably produce a consistently desirable output from different forms of language input. The prior works of scene generation have been mostly confined to rigorous sentence structures of text input which restrict the freedom of users to write description. In our work, we study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description without any major restriction. In particular, to keep our study practical and tractable, we focus on a small subspace of all possible 3D scenes, containing various combinations of cube, cylinder and sphere. We design a two-stage pipeline. In the first stage, we encode the free-form text using an encoder-decoder neural architecture. In the second stage, we generate a 3D scene based on the generated encoding. Our neural architecture exploits state-of-the-art language model as encoder to leverage rich contextual encoding and a new multi-head decoder to predict multiple features of an object in the scene simultaneously. For our experiments, we generate a large synthetic data-set which contains 13,00,000 and 14,00,000 samples of unique static and animated scene descriptions, respectively. We achieve 98.427% accuracy on test data set in detecting the 3D objects features successfully. Our work shows a proof of concept of one approach towards solving the problem, and we believe with enough training data, the same pipeline can be expanded to handle even broader set of 3D scene generation problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题