Preprocessing

clean_context

I substitute some special symbols using regular expression and split by predefined symbols.

Parameters

the input is a string.
output is a list whose element is a token.

Example

input: “Even though supervised ones tend to perform best in terms of accuracy, they often lose ground to more flexible knowledge-based solutions, which do not require training by a word expert for every disambiguation target.”
output: [‘even’, ‘though’, ‘supervised’, ‘ones’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘terms’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solutions’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]

def clean_context(ctx_in, has_target=False):
    replace_newline = re.compile("\n")
    replace_dot = re.compile("\.")
    replace_cite = re.compile("'")
    replace_frac = re.compile("[\d]*frac[\d]+")
    replace_num = re.compile("\s\d+\s")
    rm_context_tag = re.compile('<.{0,1}context>')
    rm_cit_tag = re.compile('\[[eb]quo\]')
    rm_misc = re.compile("[\[\]\$`()%/,\.:;-]")

    ctx = replace_newline.sub(' ', ctx_in)  # (' <eop> ', ctx)

    ctx = replace_dot.sub(' ', ctx)  # .sub(' <eos> ', ctx)
    ctx = replace_cite.sub(' ', ctx)  # .sub(' <cite> ', ctx)
    ctx = replace_frac.sub(' <frac> ', ctx)
    ctx = replace_num.sub(' <number> ', ctx)
    ctx = rm_cit_tag.sub(' ', ctx)
    ctx = rm_context_tag.sub('', ctx)
    ctx = rm_misc.sub('', ctx)

    word_list = [word for word in re.split('`|, | +|\? |! |: |; |\(|\)|_|,|\.|"|“|”|\'|\'', ctx.lower()) if word]
    return word_list

lemmatize_data

For each word, I lemmatize it in order to reduce some words.

Parameters

input_data: a list of tokens returned by clean_context function.
output: a list of lemmatized tokens

Example

input: [‘even’, ‘though’, ‘supervised’, ‘ones’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘terms’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solutions’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]
output: [‘even’, ‘though’, ‘supervised’, ‘one’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘term’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solution’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]

def lemmatize_data(input_data):
    result = []
    wnl = WordNetLemmatizer()
    for token in input_data:
        result.append(wnl.lemmatize(token))
    return result 

In summary

Now I take advantage of the features of DataFrame and these two utility function I mentioned above to preprocess the data.

import nltk
from KaiCode.preprocessing import clean_str,lemmatize_data
from nltk.corpus import stopwords
import re
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
# Remove blank rows if any.
Corpus['full_text'].dropna(inplace=True)
Corpus['full_text'] = [lemmatize_data(clean_context(entry)) for entry in Corpus['full_text']]
Corpus['full_text'] = [' '.join([token for token in entry if token not in stop_words]) for entry in Corpus['full_text']]

Step1: drop all the empty rows
Step2: get a list whose element is a list of clean tokens
Step3: remove all the stopwords in each sentence

上一篇:数据预处理 | 使用 sklearn.preprocessing.OrdinalEncoder 将分类特征转换为数值型


下一篇:理解 sklearn.preprocessing.MinMaxScaler