DocRED是thunlp在2019年发布的一个大规模、人工标注、通用领域的篇章级别关系抽取数据集。数据来源是wikipedia和wikidata。paper, code
Dataset:
人工标注数据:来自5053篇*文档,共13w个实体,5w个关系
远程监督数据:来自101873篇*文档(10w+),共255w个实体,88w个关系
1.数据处理流程:
Stage 1: Distantly Supervised Annotation Generation
*文档的introductory section,先用spaCy做NER,然后把实体链到wikidata上(相同KB id的就合并),query wikidata得到实体之间的关系(看起来wikidata没有的实体就舍弃了,最后得到的所有实体和关系都是wikidata内有的)
p.s.舍弃短于128词的,以及实体或关系少于4的篇章。共得到107050篇。随机选取5053篇、频率最高的96种关系进行人工标注。
Stage 2: Named Entity and Coreference Annotation
人工校对Stage 1的实体,并进行共指合并。
Stage 3: Entity Linking
把每一个named entity mention对应到一组候选wikidata item的集合中(这里用了多种预测方式,比如name匹配、用TagMe推荐、wikidata hyperlink等,防止预测漏掉某些关系)。
Stage 4: Relation and Supporting Evidence Collection
首先基于Stage 3的结果,利用RE模型和远程监督预测关系,然后人工校对,并选出supporting evidence.
We recommend 19.9 relation instances per document from entity linking, and 7.8 from RE models for supplement.
Finally 57.2% relation instances from entity linking and 48.2% from RE models are reserved.
Stage 5: 产生远程监督数据
除去人工标注的那些,剩下的文档做远程监督数据。为了保证和标注数据分布一致,用标注数据finetune一个BERT,预测NER。把每个实体mention链接到一个wikidata item上,同样用相同KB id合并,然后用远程监督预测关系。
2.数据集文件:
- rel_info.json
存储关系信息,总共有96种关系,存储格式为:(key是wikidata ID,具体解释见论文附录){"P6": "head of government", "P17": "country", "P19": "place of birth", "P20": "place of death", "P22": "father", "P25": "mother", "P26": "spouse", "P27": "country of citizenship", ...}
- train_annotated.json 人工标注训练集
(人工标注数据中)至少40.7%的数据只能通过多个句子推断出来: According to the statistics on our human-annotated corpus sampled from Wikipedia documents, at least 40.7% relational facts can only be extracted from multiple sentences, which is not negligible
- train_distant.json 远程监督训练集
- dev.json 验证集
- test.json 测试集
3.数据格式:
Data Format:
{
'title',
'sents': [
[word in sent 0],
[word in sent 1]
]
'vertexSet': [
[
{ 'name': mention_name,
'sent_id': mention in which sentence,
'pos': postion of mention in a sentence,
'type': NER_type}
{anthor mention}
],
[anthoer entity]
]
'labels': [
{
'h': idx of head entity in vertexSet,
't': idx of tail entity in vertexSet,
'r': relation,
'evidence': evidence sentences' id
}
]
}
4.数据分析
关系类型:纯模板匹配、逻辑推理(多跳)、共指、常识推理…
5. 使用的baseline
CNN, LSTM, BiLSTM, Context-aware Attention