自动补全与语言模型
overview
你将会:
- create language model(LM) from text corpus to
- Estimate probability of word sequences
- Estimate probability of a word following a sequence of words
- Apply this concept to autocomplete a sentence with most likely suggestions
language model(LM) 的其他应用:
-
Speech recognition 使用语言模型知道是
I saw a van
而不是
eyes awe of an
-
Spelling correction 用于拼写更正,如
He entered the ship to buy some groceies
,P(entered the shop to buy)>P(entered the ship to buy) -
Augmentative communication 增强通信系统
Predict most likely word from menu for people unable to physically talk or sign.(Newell et al.,1998)
Learning objectives to sentence auto-complete :
- Process text corpus to N-gram language model 通过句子的前一个词来预测下一个词
- Out of vocabulary words 调整模型使其处理在训练中没有出现的词
- Smoothing for previously unseen N-grams 假装未出现过的单词至少出现一次
- Language model evaluation
N-grams and probabilities
N-gram介绍
An N-gram is a sequence of N words是单词序列(也可以是字符或者其他元素),在处理语料库时,标点符号被视为单词,其他的特殊字符如代码都会去掉
sequence probabilities
-
给定句子,如何计算概率?
P(the teacher drinks tea)=?
-
会议条件概率和链式规则
\[P(B|A)=\frac{P(A,B)}{P(A)}→P(A,B)=P(A)P(B|A)\\ P(A,B,C,D)=P(A)P(B|A)P(C|A)P(C|A,B)P(D|A,B,C) \]所以P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks|the teacher)P(tea|the teacher drinks)
这是理想情况,但是实际上有局限
Sentence not in corpus:
-
问题:语料库几乎从不包含我们感兴趣的确切句子甚至更长的子序列!
Input:
the teacher drinks tea
计算公式为P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks|the teacher)P(tea|the teacher drinks),但是对于P(tea|the teacher drinks)这一项
\[P(tea|the\ teacher\ drinks\ tea)=\frac{C(the\ teacher\ drinks\ tea)\leftarrow both }{C(the\ teacher\ drinks)\leftarrow likely\ 0} \]随着句子变长,越来越多单词的确切顺序可能性会越来越小,
所以P(the teacher drinks tea)<P(drinks)
Approximation of sequence probabilities
P(the teacher drinks tea)
,如果不是在drinks
之前查找所有单词,就只需要考虑前一个单词(即P(tea|the teacher drinks)的前一个是P(drinks|the teacher))drinks
,使用 Bigram(二元组)近似,简化公式
只需要应用马尔可夫假设而不是查找所有单词
-
Markov assumption: only last N words matter 只看前n个单词
-
Bigram
\[P(w_n|w_1^{n-1})\approx P(w_n|w_{n-1}) \] -
N-gram
\[P(w_n|w_1^{n-1})\approx P(w_n|w_{n-N+1}^{n-1}) \] -
Entire sentence modeled with bigram
\[P(w_1^n)\approx \prod_{i=1}^nP(w_i|w_{i-1})\\ P(w_1^n)\approx P(w_1)P(w_2|w_1)...P(w_n|w_{n-1}) \]
Starting and ending sentence
start of sentence token<s>
the teacher drinks tea
P(the teacher drinks tea)≈P(the)P(teacher|the)P(drinks|teacher)P(tea|drinks)
P(the)
因为the没有上下文,不能计算bigram概率,所以需要做预测,添加一个特殊术语<s>the teacher drinks tea
将每一个词都变得可以计算
P(<s> the teacher drinks tea)≈P(the|<s>)P(teacher|the)P(drinks|teacher)P(tea|drinks)
类似的原理也使用于N-grams
start of sentence token <s> for N-grams:
-
Trigram: P(the teacher drinks tea)
≈P(the)P(teacher|the)P(drinks|teacher)P(tea|drinks)
缺少的上下文可以解决:
添加两个开头token:
the teacher drinks tea => <s> <s> the teacher drinks tea
所以现在句子的概率是
\[P(w_1^n)\approx P(w_1|<s>\ <s>)P(w_2|<s>w_1)...P(w_n|w_{n-1}w_{n-1}) \] -
N-gram model: add N-1 start tokens <s>
End of sentence token </s>
来由
计算C(drinks w)的总数=1,而C(drinks)=2,为了继续在条件概率中用简化的公式,需要加入结尾标记
我们希望所有长度单词可能性和为1,如下图
solution
-
Bigram
<s> the teacher drinks tea => <s> the teacher drinks tea </s>
P(the|<s>)P(teacher|the)P(drinks|teacher)P(tea|drinks)P(</s>|tea)
Corpus:
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
-
N-grams
N-grams => just ones </s>
E.g. Trigram:
the teacher drinks tea => <s> <s> the teacher drinks tea </s>
Example-bigram
现在尝试计算Lyn
以<s>
开头的概率,因为<s>
在语料库出现了三次,而Lyn
以<s>
开头出现两次,所以
计算<s> Lyn drinks chocolate </s>
第一个2/3是<s> Lyn
,1/2是Lyn drinks
,1/2是drinks chocolate
,2/2是chocolate </s>
得到1/6,比我们想象的1/3要低
N-gram language model
Outline:
- Count matrix 将语料库处理到矩阵中
- Probabilities matrix 将count 矩阵转换为概率矩阵
- Language model
- Log probability to avoid underflow
- Generative language model
Count matrix
-
Rows: unique corpus (N-1)-grams
-
Columns: unique corpus words
-
Bigram
count matrix
通过两个单词的滑动窗口在corpus上的移动,将矩阵中的值加一
Probability matrix
\[P(w_n|w_{n-N+1}^{n-1})=\frac{C(w_{n-N+1}^{n-1},w_n)}{C(w_{n-N+1}^{n-1})} \]- Divide each cell by its row sum\[sum(row)=\sum_{w\in V}C(w_{n-N+1}^{n-1},w_n)=C(w_{n-N+1}^{n-1}) \]
Language model
-
probability matrix => language model
-
sentence probability
-
next word prediction
-
Log probability
\[P(w_1^n)\approx\prod_{i=1}^nP(w_i|w_{i-1})\\ log(a*b)=log(a)+log(b) \]Generative Language model
从头或以很少的提示生成文本
算法:
- Choose sentence start 即
<s>
- Choose next bigram starting with previous word 选取二元单词组
- coninue until
</s>
is picked 重复直到选取了结束符号
Evaluation model
Test data
-
将预料库分割为 train/validation/test
-
For small corpora
- 80% Train
- 10% validation
- 10% test
-
For largecorpora (typical for text)
- 98% train
- 1% validaton
- 1% test
Test data-split method
Perplexity
困惑度为样本中复杂度的量度,如文本复杂度;
困惑度告诉我们文本是由人类编写而不是由一个随机选择单词的简单程序编写:
由人类编写则proplexity低,随机生成则高
\[PP(W)=P(s_1,s_2,...,s_m)^{-\frac{1}{m}}\\ \]W→test set containing m sentencess s
i→i-th sentence in the test set,each ending with </s>
m→number of all words in entire test set W including
</s> but not including <s>
Eg. m=100 (100个单词)
第一个模型:
\[P(W)=0.9=>PP(W)=0.9^{-\frac{1}{100}}=1.00105416 \]P(W)=0.9测试集概率0.9说明可以较好的预测。同时困惑度很低
第二个模型:
\[P(W)=10^{-250}=>PP(W)=(10^{-25})^{-\frac{1}{100}}\approx 316 \]-
Perplexity 越小越好
好的模型在20-60之间
-
Character level models PP < word-based models PP
Perplexity for bigram models
\[PP(W)=\sqrt[m]{\prod_{i=1}^m\prod_{j=1}^{|s_i|}\frac{1}{P(w_j^{(i)}|w_{j-1}^{(i)})}}\\ w_j^{(i)} → j-th\ word\ in\ i-th\ sentence \]一样的概率面对不同的测试集时,m越大,PP越低
- concatenate all sentences in W
如果测试集中所以句子都是串联在一起的,公式可简化为
\[PP(W)=\sqrt[m]{\prod_{i=1}^m\frac{1}{P(w_i|w_{i-1})}}\\ w_i→i-th\ word\ in\ test\ set \]Log perplexity
使用Log更容易计算,因为好的模型PP在20-60,所以LogPP在4.3-5.9
\[logPP(W)=-\frac{1}{m}\sum_{i=1}^mlog_2(P(w_i|w_{i-1})) \]Unigram困惑度很高,Bigram才合理一些,Trigram困惑度更好
<UNK> Unknown word
Out of vocabulary words
- Closed vs. Open vocabularies
- Unknown word = Out of vocabulary word (OOV)
- special tag <UNK> in corpus and in input 用特殊词UNK建模
Using <UNK> in corpus
- 建立词汇表 V
- Replace any word in corpus and not in V by <UNK>
- Count the probabilities with <UNK>as with any other word
如何建立词汇表
-
Criteria 标准
- Min word frenquency f :最小词频f,超过f的选入V中,替换所有其他单词为UNK
- Max |V|, include words by frequency :决定最大词汇量,选入词频高的直到装满
-
Use <UNK> sparingly 谨慎使用
UNK会让困惑度降低,但是有可能模型会生成一系列UNK代替有意义的句子
-
Perplexity -ony compare LMs with same V 只比较相同词汇表的语言模型
Smoothing
Add-one smoothing (Laplacian smoothing)
\[P(w_n|w_{n-1})=\frac{C(w_{n-1,w_n})+1}{\sum_{w\in V}(C(w_{n-1},w)+1)}\\ =\frac{C(w_{n-1},w_n)+1}{C(w_{n-1})+V} \]Add-k smoothing
\[P(w_n|w_{n-1})=\frac{C(w_{n-1},wn)+k}{\sum_{w\in V(C(w_{n-1},w)+k)}}\\ =\frac{C(w_{n-1},w_n)+k}{C(w_{n-1})+k*V} \]其他更先进的平滑
Kneser-Ney
Good-Turing
Backoff
-
If N-gram missing => user (N-1) gram,... 如果Ngram缺失就用N-1 gram,如果N-1 gram缺失就用 N-2 gram
-
Probability discounting e.g. Katz backoff
一般来说低阶模型出现的频率更高(低阶的元组包含在高阶元组中,阶数越高越稀疏),从而更加可靠,于是乎,未出现的高阶元组可以利用高阶元组中包含的低阶元组对其进行平滑。总之,高阶模型可靠时候便使用高阶模型,否则使用低阶模型。
-
“Stupid" backoff
即对一个语言模型,每当某个片段计数为0时,就往前一个词计算概率,并设有一个权重0.4
$$ S\left(w_{i} \mid w_{i-k+1}^{i-1}\right)= \quad\left\{\begin{aligned} \frac{f\left(w_{i-k+1}^{i}\right)}{f\left(w_{i-k+1}^{i-1}\right)} & \text { if } f\left(w_{i-k+1}^{i}\right)>0 \\ \alpha S\left(w_{i} \mid w_{i-k+2}^{i-1}\right) & \text { otherwise } \end{aligned}\right. \\ $$
-
interpolation插值
如在Tri-gram中
\[\hat P(chocolate|John drinks)=0.7*P(chocolate|John\ drinks)+0.2*P(chocolate|drinks)+0.1*P(chocolate) \] \[\hat P(w_n|w_{n-2}\ w_{n-1})=\lambda_1\ *P(w_n|w_{n-2}\ w_{n-1})+\\ \lambda_2*P(w_n|w_{n-1})+\lambda*P(w_n)\\ \mathop{\sum}_{i}\lambda_i=1 \]Summary
-
N-Grams and probabilities
-
Approximate sentence probability from N-Grams
-
Build language model from corpus
-
Fix missing information
-
Out of vocabulary words with <UNK>
-
Missing N-Gram in corpus with smoothing, backoff and interpolation
-
Evaluate language model with perplexity
-
Coding assignment!