KMP算法简介
KMP算法是用来做字符串匹配的,比如目标字符串:'acabacacd’是否能在源字符串:'acfacabacabacacdkacfacabacabacacdk’找到匹配。传统的暴力解法复杂度是O(nk)的,k是目标字符串长度,n是源字符串长度。所以有优化的KMP算法复杂度为O(n)
算法核心
这种算法的核心在于如何处理某个字符不match的情况。对于传统暴力算法,当某个位置字符不match的时候,就会往下顺移一位,这也是为什么暴力算法的复杂度是O(nk)。而KMP在某个位置字符不match的时候,会跳过某些不需要的比较,牵涉到的核心概念是“Longest Proper Prefix which is Suffix” (lps数组)。lps数组是一个整数数组,长度与目标字符串一致,每个位置的整数值代表“以这个位置index为结尾的后缀传和这个位置的数字作为index的之前的前缀串相同”。这个确实很不好理解,但是总体的意思就是,有了这个数组,当某个位置字符不match的时候,我们就无需一位一位顺移,而是可以跳 (具体的原理非常难以理解,所以建议暂时先理解这个high level的idea)
lps数组计算
这个数据计算具体的implementation也是非常的难以理解,虽然只有寥寥几行代码,但是想要完全理解为什么是这样依旧非常困难。如果想要完全搞清楚这个意义,建议看下面两个详细教程:教程1,教程2
下面是根据我浅显的理解写的implementation。更详细的解释在上面两个教程里面也有。想不通了就回去看看
def create_LPS(w):
# initialize the array with all 0s
LPS = [0] * len(w)
# tp means the pointer at the prefix position
tp = 0
# bp means the pointer at the suffix position
bp = 1
# iterate through the whole word
while bp < len(w):
# if two pointers points to the same char, we give a value in the lps array and increment two pointers
if w[tp] == w[bp]:
LPS[bp] = tp + 1
tp += 1
bp += 1
# if two pointers points to different char, there are two posibilities:
else:
# 1. if tp is 0, we keep tp and put zero value in lps, and only increment bp by one (we don't increment tp because no match char found)
if tp == 0:
LPS[bp] = 0
bp += 1
# 2. else, comes the part that very difficult to understand, we update the tp using the value in lps
else:
tp = LPS[tp-1]
return LPS
完整KMP算法
有了这个lps数组,一旦遇到不match的字符,直接根据lps数组跳过无需比较的位置
LPS = create_LPS(word)
pt = 0
pw = 0
while pt < len(text):
# if current place match, keep going until found the complete match or a mismatch
if text[pt] == word[pw]:
pt += 1
pw += 1
if pw == len(word):
print('found',pt-len(word),pt)
# we can jump again if we want to search for multiple matching once one match is already found
pw = LPS[pw-1]
# if current place does not match, two conditions
else:
# if pointer in word is 0, means the beginning does not match, we only need to increment the pointer in the text
if pw == 0:
pt += 1
# if pointer in word is not 0, means there is some place we can jump according to the LPS array
else:
pw = LPS[pw-1]
在具体的详细教程后面再写吧,确实比较难解释。。。