KMP算法简介-string匹配

KMP算法简介

KMP算法是用来做字符串匹配的,比如目标字符串:'acabacacd’是否能在源字符串:'acfacabacabacacdkacfacabacabacacdk’找到匹配。传统的暴力解法复杂度是O(nk)的,k是目标字符串长度,n是源字符串长度。所以有优化的KMP算法复杂度为O(n)

算法核心

这种算法的核心在于如何处理某个字符不match的情况。对于传统暴力算法,当某个位置字符不match的时候,就会往下顺移一位,这也是为什么暴力算法的复杂度是O(nk)。而KMP在某个位置字符不match的时候,会跳过某些不需要的比较,牵涉到的核心概念是“Longest Proper Prefix which is Suffix” (lps数组)。lps数组是一个整数数组,长度与目标字符串一致,每个位置的整数值代表“以这个位置index为结尾的后缀传和这个位置的数字作为index的之前的前缀串相同”。这个确实很不好理解,但是总体的意思就是,有了这个数组,当某个位置字符不match的时候,我们就无需一位一位顺移,而是可以跳 (具体的原理非常难以理解,所以建议暂时先理解这个high level的idea)

lps数组计算

这个数据计算具体的implementation也是非常的难以理解,虽然只有寥寥几行代码,但是想要完全理解为什么是这样依旧非常困难。如果想要完全搞清楚这个意义,建议看下面两个详细教程:教程1教程2
下面是根据我浅显的理解写的implementation。更详细的解释在上面两个教程里面也有。想不通了就回去看看

def create_LPS(w):
    # initialize the array with all 0s
    LPS = [0] * len(w)
    # tp means the pointer at the prefix position
    tp = 0
    # bp means the pointer at the suffix position
    bp = 1
    # iterate through the whole word
    while bp < len(w):
        # if two pointers points to the same char, we give a value in the lps array and increment two pointers
        if w[tp] == w[bp]:
            LPS[bp] = tp + 1
            tp += 1 
            bp += 1 
        # if two pointers points to different char, there are two posibilities:
        else:
            # 1. if tp is 0, we keep tp and put zero value in lps, and only increment bp by one (we don't increment tp because no match char found)
            if tp == 0:
                LPS[bp] = 0
                bp += 1 
            # 2. else, comes the part that very difficult to understand, we update the tp using the value in lps
            else:
                tp = LPS[tp-1]
                
    return LPS

完整KMP算法

有了这个lps数组,一旦遇到不match的字符,直接根据lps数组跳过无需比较的位置

LPS = create_LPS(word)
pt = 0
pw = 0

while pt < len(text):
    # if current place match, keep going until found the complete match or a mismatch
    if text[pt] == word[pw]:
        pt += 1 
        pw += 1
        if pw == len(word):
            print('found',pt-len(word),pt)
            # we can jump again if we want to search for multiple matching once one match is already found
            pw = LPS[pw-1]
    # if current place does not match, two conditions
    else:
        # if pointer in word is 0, means the beginning does not match, we only need to increment the pointer in the text
        if pw == 0:
            pt += 1 
        # if pointer in word is not 0, means there is some place we can jump according to the LPS array
        else:
            pw = LPS[pw-1]

在具体的详细教程后面再写吧,确实比较难解释。。。

引用:

教程1教程2

上一篇:KMP子串查找算法


下一篇:查找字符串