论文标题
Impakt:开放式知识库构造的数据集
ImPaKT: A Dataset for Open-Schema Knowledge Base Construction
论文作者
论文摘要
大型语言模型已经迎来了语义解析的黄金时代。 SEQ2SEQ范式允许仅给出少量的列出数据的开放式和抽象属性和关系提取。语言模型预训练同时促进了自然语言推论的大步进展,有关元素的推理和自由文本的含义。这些进步激励我们构建Impakt,这是一个用于开放式信息信息提取的数据集,由C4语料库中的大约2500个文本摘要组成,在购物域(产品购买指南)中,专业注释,提取属性,类型,属性,属性摘要(从iDiosyncratic ncratic nocyncratic shemage(属性schema)发现,既有文本),并且是构成的,许多关系的属性和构造构成了构成的构造,构成了综合性的构成。我们发布此数据,希望它将在微调语义解析器中有用,以跨多种域进行信息提取和知识库结构。我们通过在数据集的一个子集上微调开源UL2语言模型,从产品购买指南的语料库中提取一组含义关系,并对由此产生的预测进行人体评估来评估这种方法的功能。
Large language models have ushered in a golden age of semantic parsing. The seq2seq paradigm allows for open-schema and abstractive attribute and relation extraction given only small amounts of finetuning data. Language model pretraining has simultaneously enabled great strides in natural language inference, reasoning about entailment and implication in free text. These advances motivate us to construct ImPaKT, a dataset for open-schema information extraction, consisting of around 2500 text snippets from the C4 corpus, in the shopping domain (product buying guides), professionally annotated with extracted attributes, types, attribute summaries (attribute schema discovery from idiosyncratic text), many-to-one relations between compound and atomic attributes, and implication relations. We release this data in hope that it will be useful in fine tuning semantic parsers for information extraction and knowledge base construction across a variety of domains. We evaluate the power of this approach by fine-tuning the open source UL2 language model on a subset of the dataset, extracting a set of implication relations from a corpus of product buying guides, and conducting human evaluations of the resulting predictions.