词典库
#list查找的复杂度为线性复杂度
#转换为set,复杂度一般为log n
vocab=set([line.rstrip() for line in open('vocab.txt')])
print(vocab)
{'rts', 'tactual', 'unavoidably', 'Interstate', 'Compared', 'vulcanized', 'Shakya', 'lowliest', 'Weber', 'Vocational', 'destroyer', 'Flying', 'showmanship', 'spurred', 'Devout', 'Amstutz', 'totality', 'Sides', 'Dietetic', 'JYJ', 'struggled', 'translated', 'myth', 'alienate', 'corona', 'exemplifies', 'showcase', 'drawers', 'stringently', 'companion', 'hopping', 'Reference', 'sauces', 'requiring', 'strove', 'volens', 'complex', 'confronted', 'playbacks', 'confinement', 'sorption', 'black', 'Churches', 'Erie', 'manila', 'Rumanians', 'ancestry', 'sank', 'reagents', 'favorable', 'thorny', 'lifts', 'rhododendron', 'abduction', 'accenting', 'platinum', 'carries', 'barrel', 'detract', 'resisted', '102', 'Magnum', 'attained', 'engineer', 'naturalness', 'Wellesley', 'fiche', 'multistage', 'Sandalwood', 'stress', 'emotion', 'concluded', 'justitia', 'tumbled', 'activation', 'instance', 'harsher', 'phantom', 'sanctions', 'abortion', 'Gianicolo', 'Ching', 'limply', 'Stronghold', 'draftees', 'campagna', 'badinage', 'Couperin', 'Seminar', 'choked', 'involvement', 'unveiled', 'excite', 'jabbed', 'rotations', 'arteriosclerosis', 'geology', 'Redland', 'Mill', 'Schramm', 'candle', 'brave', 'Lovelace', 'Compagnie'}
需要生成所有首选集合
#需要生成所有候选集合
def generate_candidates(word):
'''
word:给定的输入(错误的输入)
返回所有(valid)候选集合
'''
#生成编辑距离为1的单词
#1.insert 2.delete 3.replace
#假设使用26个字符
letters='abcdefghijklmnopqrstuvwxyz'
splits=[(word[:i],word[i:]) for i in range(len(word)+1)]
#print(splits)
#[('', 'apple'), ('a', 'pple'), ('ap', 'ple'), ('app', 'le'), ('appl', 'e'), ('apple', '')]
#insert操作
inserts= [L+c+R for L,R in splits for c in letters]
#delete操作
deletes= [L+R[1:] for L,R in splits if R]
#replace操作
replace=[L+c+R[1:] for L,R in splits if R for c in letters]
candidates = set(inserts+deletes+replace)
return [word for word in candidates if word in vocab]
print(generate_candidates('apple'))
['apply', 'apple', 'ample', 'apples']
读取语料库
import nltk
nltk.download('punkt')
from nltk.corpus import reuters
#读取nltk自带的语料库
categories=reuters.categories()
corpus=reuters.sents(categories=categories)
构建语言模型:bigram
#构建语言模型:bigram
term_count={}
bigram_count={}
for doc in corpus:
#<s> I like basketball.这样可以计算P(I|<s>)
doc=['<s>']+doc
#最后一个单词后面没有任何字符
for i in range(0,len(doc)-1):
#bigram:[i,i+1]
term=doc[i]
bigram=doc[i:i+2]
if term in term_count:
term_count[term]+=1
else:
term_count[term]=1
bigram=''.join(bigram)
if bigram in bigram_count:
bigram_count[bigram]+=1
else:
bigram_count[bigram]=1
print(term_count)
print(bigram_count)
用户打错的概率统计
# 用户打错的概率统计 - channel probability
channel_prob = {}
for line in open('spell-errors.txt'):
items = line.split(":")
correct = items[0].strip()
mistakes = [item.strip() for item in items[1].strip().split(",")]
channel_prob[correct] = {}
for mis in mistakes:
channel_prob[correct][mis] = 1.0/len(mistakes)
print(channel_prob)
纠错
#总共有多少个词
V=len(term_count.keys())
import numpy as np
file = open('testdata.txt','r')
for line in file:
items = line.rstrip().split('\t')
line = items[2].split()
# line = ["I", "like", "playing"]
for word in line:
if word not in vocab:
# 需要将word替换成正确的单词
# Step1:生成所有的(valid)候选集合
candidates = generate_candidates(word)
#一种方式:if candidate=[],多生成几个candidates
#TODO:根据条件生成更多的候选集合
probs = []
# 对于每一个candidate,计算它的score
# score = p(correct)*p(mistakes|correct)
# = log p(correct) + log p(mistake|correct)
# 返回score最大的candidate
for candi in candidates:
prob=0
# a.计算channel probability
# 若词不存在map里,则给它很小的概率
if candi in channel_prob and word in channel_prob[candi]:
prob += np.log(channel_prob[candi][word])
else:
prob += np.log(0.001)
# b.计算语言模型的概率
idx = items[2].index(word)+1
#前面默认有个<s>,所以要+1
if items[2][idx - 1] in bigram_count and candi in bigram_count[items[2][idx - 1]]:
prob += np.log((bigram_count[items[2][idx - 1]][candi] + 1.0) / (
term_count[bigram_count[items[2][idx - 1]]] + V))
# TODO: 也要考虑当前 [word, post_word]
# prob += np.log(bigram概率)
else:
prob += np.log(1.0 / V)
probs.append(prob)
max_idx = probs.index(max(probs))
print(word, candidates[max_idx])
运行结果:
protectionst protectionist
products. products
long-run, long-run
gain. gain
gain. gain
17, 17e
17, 17e
retaiation retaliation
cost. cost
cost. cost
cost. cost
busines, business
ltMC.T. ltMC.T
U.S., U.S.
Murtha, Murtha
worried. worried
seriousnyss seriousness
aganst against
.............