Automatic Text Difficulty Classifier Assisting the Selection Of Adequate Reading Materials For European Portuguese Teaching --paper

the system uses existing Natural Language Processing (NLP) tools, a parser and an hyphenator, and two corpora, previously annotated by readability level.

hyphenator:

h_en.pairs('beautiful'
[['beau', 'tiful'], [u'beauti', 'ful']]

the system extracts 52 features, grouped in 7 groups: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and some extra features.

语言:葡萄牙语

one based on a five-levels scale
(A1, A2, B1, B2, C1)
a second experiment based in a simplified
three-levels scale (A, B and C)

3 nlp工具
STRING:相当于葡萄牙语的nltk
The YAH Hyphenator:This is a rule-based system that applies
various word processing division rules.

hypotaxis 从属结构
parataxis 并列结构

4 特征
The set of 52 features extracted by the system consists
in: (i) part-of-speech (POS) tags, chunks, words
and sentences features; (ii) verb features and different
metrics involving averages and frequencies; (iii)
several metrics involving syllables; and (iv) extra features.

名词、命名体识别对文本理解很重要
句法结构:名词短语、介词短语
助动词可以形成更长更复杂的动词链
hypotaxis 从属结构
parataxis 并列结构
Word frequency:unigram-based,拉普拉斯平滑
动词、名词比例,句长

上一篇:一个web应用的诞生(3)--美化一下


下一篇:PHP标准库 (SPL) 笔记