clean_context
I substitute some special symbols using regular expression and split by predefined symbols.
Parameters
the input is a string.
output is a list whose element is a token.
Example
input: “Even though supervised ones tend to perform best in terms of accuracy, they often lose ground to more flexible knowledge-based solutions, which do not require training by a word expert for every disambiguation target.”
output: [‘even’, ‘though’, ‘supervised’, ‘ones’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘terms’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solutions’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]
def clean_context(ctx_in, has_target=False):
replace_newline = re.compile("\n")
replace_dot = re.compile("\.")
replace_cite = re.compile("'")
replace_frac = re.compile("[\d]*frac[\d]+")
replace_num = re.compile("\s\d+\s")
rm_context_tag = re.compile('<.{0,1}context>')
rm_cit_tag = re.compile('\[[eb]quo\]')
rm_misc = re.compile("[\[\]\$`()%/,\.:;-]")
ctx = replace_newline.sub(' ', ctx_in) # (' <eop> ', ctx)
ctx = replace_dot.sub(' ', ctx) # .sub(' <eos> ', ctx)
ctx = replace_cite.sub(' ', ctx) # .sub(' <cite> ', ctx)
ctx = replace_frac.sub(' <frac> ', ctx)
ctx = replace_num.sub(' <number> ', ctx)
ctx = rm_cit_tag.sub(' ', ctx)
ctx = rm_context_tag.sub('', ctx)
ctx = rm_misc.sub('', ctx)
word_list = [word for word in re.split('`|, | +|\? |! |: |; |\(|\)|_|,|\.|"|“|”|\'|\'', ctx.lower()) if word]
return word_list
lemmatize_data
For each word, I lemmatize it in order to reduce some words.
Parameters
input_data: a list of tokens returned by clean_context function.
output: a list of lemmatized tokens
Example
input: [‘even’, ‘though’, ‘supervised’, ‘ones’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘terms’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solutions’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]
output: [‘even’, ‘though’, ‘supervised’, ‘one’, ‘tend’, ‘to’, ‘perform’, ‘best’, ‘in’, ‘term’, ‘of’, ‘accuracy’, ‘they’, ‘often’, ‘lose’, ‘ground’, ‘to’, ‘more’, ‘flexible’, ‘knowledgebased’, ‘solution’, ‘which’, ‘do’, ‘not’, ‘require’, ‘training’, ‘by’, ‘a’, ‘word’, ‘expert’, ‘for’, ‘every’, ‘disambiguation’, ‘target’]
def lemmatize_data(input_data):
result = []
wnl = WordNetLemmatizer()
for token in input_data:
result.append(wnl.lemmatize(token))
return result
In summary
Now I take advantage of the features of DataFrame and these two utility function I mentioned above to preprocess the data.
import nltk
from KaiCode.preprocessing import clean_str,lemmatize_data
from nltk.corpus import stopwords
import re
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
# Remove blank rows if any.
Corpus['full_text'].dropna(inplace=True)
Corpus['full_text'] = [lemmatize_data(clean_context(entry)) for entry in Corpus['full_text']]
Corpus['full_text'] = [' '.join([token for token in entry if token not in stop_words]) for entry in Corpus['full_text']]
Step1: drop all the empty rows
Step2: get a list whose element is a list of clean tokens
Step3: remove all the stopwords in each sentence