论文标题
NTULM:用非文本单元丰富社交媒体文本表示
NTULM: Enriching Social Media Text Representations with Non-Textual Units
论文作者
论文摘要
在社交媒体上,通常以注释和元数据的形式出现其他背景,例如帖子的作者,提及,主题标签和超链接。我们将这些注释称为非文本单位(NTU)。我们认为,NTU提供了社会背景,而不是其文本语义和利用这些单位可以丰富社交媒体文本表示形式。在这项工作中,我们构建了一个以NTU为中心的社会异质网络,以共同使用NTU。然后,我们主要通过对这些其他单元进行微调来将这些NTU嵌入到一个预验证的语言模型中。这为嘈杂的短文本社交媒体增加了上下文。实验表明,利用NTU的文本表示形式在许多下游任务上的相对点显着优于仅2-5 \%的相对点,突显了上下文对社交媒体NLP的重要性。我们还强调,在生成文本嵌入后,将NTU上下文与文本旁边的初始层次融入语言模型的初始层要好。我们的工作导致整体通用社交媒体内容嵌入。
On social media, additional context is often present in the form of annotations and meta-data such as the post's author, mentions, Hashtags, and hyperlinks. We refer to these annotations as Non-Textual Units (NTUs). We posit that NTUs provide social context beyond their textual semantics and leveraging these units can enrich social media text representations. In this work we construct an NTU-centric social heterogeneous network to co-embed NTUs. We then principally integrate these NTU embeddings into a large pretrained language model by fine-tuning with these additional units. This adds context to noisy short-text social media. Experiments show that utilizing NTU-augmented text representations significantly outperforms existing text-only baselines by 2-5\% relative points on many downstream tasks highlighting the importance of context to social media NLP. We also highlight that including NTU context into the initial layers of language model alongside text is better than using it after the text embedding is generated. Our work leads to the generation of holistic general purpose social media content embedding.