DocRED数据集及其baseline

2023-12-17 09:42:27

DocRED是thunlp在2019年发布的一个大规模、人工标注、通用领域的篇章级别关系抽取数据集。数据来源是wikipedia和wikidata。paper, code

Dataset:

人工标注数据：来自5053篇*文档，共13w个实体，5w个关系
远程监督数据：来自101873篇*文档（10w+），共255w个实体，88w个关系

1.数据处理流程：

Stage 1: Distantly Supervised Annotation Generation

*文档的introductory section，先用spaCy做NER，然后把实体链到wikidata上（相同KB id的就合并），query wikidata得到实体之间的关系（看起来wikidata没有的实体就舍弃了，最后得到的所有实体和关系都是wikidata内有的）

p.s.舍弃短于128词的，以及实体或关系少于4的篇章。共得到107050篇。随机选取5053篇、频率最高的96种关系进行人工标注。

Stage 2: Named Entity and Coreference Annotation

人工校对Stage 1的实体，并进行共指合并。

Stage 3: Entity Linking

把每一个named entity mention对应到一组候选wikidata item的集合中（这里用了多种预测方式，比如name匹配、用TagMe推荐、wikidata hyperlink等，防止预测漏掉某些关系）。

Stage 4: Relation and Supporting Evidence Collection

首先基于Stage 3的结果，利用RE模型和远程监督预测关系，然后人工校对，并选出supporting evidence.

We recommend 19.9 relation instances per document from entity linking, and 7.8 from RE models for supplement.
Finally 57.2% relation instances from entity linking and 48.2% from RE models are reserved.

Stage 5: 产生远程监督数据

除去人工标注的那些，剩下的文档做远程监督数据。为了保证和标注数据分布一致，用标注数据finetune一个BERT，预测NER。把每个实体mention链接到一个wikidata item上，同样用相同KB id合并，然后用远程监督预测关系。

2.数据集文件：

rel_info.json
存储关系信息，总共有96种关系，存储格式为：(key是wikidata ID，具体解释见论文附录)

{"P6": "head of government", "P17": "country", "P19": "place of birth", "P20": "place of death", "P22": "father", "P25": "mother", "P26": "spouse", "P27": "country of citizenship", ...}

train_annotated.json 人工标注训练集

（人工标注数据中）至少40.7%的数据只能通过多个句子推断出来：
According to the statistics on our human-annotated corpus sampled from Wikipedia documents, at least 40.7% relational facts can only be extracted from multiple sentences, which is not negligible

train_distant.json 远程监督训练集
dev.json 验证集
test.json 测试集

3.数据格式:

Data Format:
{
  'title',
  'sents':     [
                  [word in sent 0],
                  [word in sent 1]
               ]
  'vertexSet': [
                  [
                    { 'name': mention_name, 
                      'sent_id': mention in which sentence, 
                      'pos': postion of mention in a sentence, 
                      'type': NER_type}
                    {anthor mention}
                  ], 
                  [anthoer entity]
                ]
  'labels':   [
                {
                  'h': idx of head entity in vertexSet,
                  't': idx of tail entity in vertexSet,
                  'r': relation,
                  'evidence': evidence sentences' id
                }
              ]
}

4.数据分析

关系类型：纯模板匹配、逻辑推理（多跳）、共指、常识推理…

5. 使用的baseline

CNN, LSTM, BiLSTM, Context-aware Attention

码农公寓