Flashtext：大规模数据清洗的利器

2023-01-27 16:33:43

Flashtext：大规模数据清洗的利器

在这篇文章中，我们将介绍一种新的关键字搜索和替换的算法：Flashtext 算法。Flashtext 算法是一个高效的字符搜索和替换算法。该算法的时间复杂度不依赖于搜索或替换的字符的数量。比如，对于一个文档有 N 个字符，和一个有 M 个词的关键词库，那么时间复杂度就是 O(N) 。这个算法比我们一般的正则匹配法快很多，因为正则匹配的时间复杂度是 O(M * N)。这个算法和 Aho Corasick 算法也有一点不同，因为它不匹配子字符串。

Flashtext 算法被设计为只匹配完整的单词。比如，我们输入一个单词 {Apple}，那么这个算法就不会去匹配 “I like Pineapple” 中的 apple。这个算法也被设计为首先匹配最长字符串。在举个例子，比如我们有这样一个数据集 {Machine， Learning，Machine Learning}，一个文档 “I like Machine Learning”，那么我们的算法只会去匹配 “Machine Learning” ，因为这是最长匹配。

这个算法我们已经在 Github 上面实现了，用的是 Python 语言。

1. 介绍

在信息检索领域，关键字搜索和替代都是很常见的问题。我们经常想要在一个特定的文本中搜索特定的关键词，或者在文本中替代一个特定的关键词。

例如：

关键字搜索：假设我们有一个软件工程师的简历 (D)，我们拥有一个 20k 的编程技巧词库 corpus = {Java, python, javascript, machien learning, …}。我们想要从这个词库 corpus 中哪些词在这个简历中出现了，即 corpus ∩ D。
关键字替换：另一个常用的用例是当我们有一个同义词的语料库（不同的拼写表示的是同一个单词），比如 corpus = {javascript: [‘javascript’, ‘javascripting’, ‘java script’], …} ，在软件工程师的简历中我们需要使用标准化的词，所有我们需要用一个标准化的词来替换不同简历中的同义词。

为了去解决上述这些问题，正则表达式是最常用的一个技术。虽然正则表达式可以很好的解决这个问题，但是当我们的数据量增大时，这个速度就会非常慢了。如果我们的文档达到百万级别时，那么这个运行时间将达到几天的量级。比如下面的图1，正则表达式在一个 10k 的词库中查找 15k 个关键词的时间差不多是 0.165 秒。但是对于 Flashtext 而言只需要 0.002 秒。因此，在这个问题上 Flashtext 的速度大约比正则表达式快 82 倍。

随着我们需要处理的字符越来越多，正则表达式的处理速度几乎都是线性增加的。然而，Flashtext 几乎是一个常量。在本文中，我们将着重讨论正则表达式与 Flashtext 之间的性能区别。我们还将详细的描述 Flashtext 算法及其工作原理，和一些基准测试。

1.1 用于关键字搜索的正则表达式

正则表达式是一种非常灵活和有用的模式匹配方式。比如我们在文本中搜索一个匹配 “\d{4}”，它表示任何 4 位数字匹配，如 2017。我们利用 Python 代码可以实现这样一个功能，如下：

import re

compiled_regex = re.compile(r'\b2017\b|\b\d{4}\b')

compiled_regex.findall('In 2017 2311 is my birthday.')

# output

['2017', '2311']

Flashtext：大规模数据清洗的利器

这里 ‘\b’ 用来表示单词边界，它会去匹配特殊字符，如 ‘space’，’period’，’new line’ 等。

1.2 用于关键字替换的正则表达式

我们也可以使用正则表达式来制作一个标准化术语的替换脚本，比如我们可以编写一个 Python 脚本来用 “javascript” 替换 “java script”。如下：

import re

re.sub(r"\bjava script\b", 'javascript', 'java script is awesome.')

# output

javascript is awesome.

Flashtext：大规模数据清洗的利器

2. Flashtext

Flashtext 是一种基于 Trie 字典数据结构和 Aho Corasick 的算法。它的工作方式是，首先它将所有相关的关键字作为输入。使用这些关键字建立一个 trie 字典，如下图3所示：

Flashtext：大规模数据清洗的利器
start 和 eot 是两个特殊的字符，用来定义词的边界，这和我们上面提到的正则表达式是一样的。这个 trie 字典就是我们后面要用来搜索和替换的数据结构。

2.1 利用 Flashtext 进行搜索

对于输入字符串（文档），我们对字符进行逐个遍历。当我们在文档中的字符序列 <\b>word<\b> 匹配到字典中的 word 时（start 和 eot 分别是字符序列的开始标签和结束标签），我们认为这是一个完整匹配了。我们将匹配到的字符序列所对应的标准关键字进行输出，具体如下：

Flashtext：大规模数据清洗的利器

2.2 利用 Flashtext 进行替换

对于输入字符串（文档），我们对字符进行逐个遍历它。我们先创建一个空的字符串，当我们字符序列中的 <\b>word<\b> 无法在 Trie 字典中找到匹配时，那么我们就简单的原始字符复制到返回字符串中。但是，当我们可以从 Trie 字典中找到匹配时，那么我们将将匹配到的字符的标准字符复制到返回字符串中。因此，返回字符串是输入字符串的一个副本，唯一的不同是替换了匹配到的字符序列，具体如下：

Flashtext：大规模数据清洗的利器

2.3 Flashtext 算法

Flashtext 算法那主要分为三部分，我们接下来将对每一部分进行单独分析：

构建 Trie 字典；
关键字搜索；
关键字替换；

2.3.1 构建 Trie 字典

为了构建 trie 字典，我们必须设立一个空的节点指向我们的空字典。这个节点被用作所有单词的起点。我们在字典中插入一个单词。这个单词中的下一个字符在本字典中作为关键字，并且这个指针需要再次指向一个空字典。这个过程不断重复，直到我们达到单词中的最后一个字符。当我们到达单词的末尾时，我们插入一个特殊的字符（eot）来表示词尾。

输入

关键词 w = c1c2c3…cn，其中 ci 表示一个输入字符。标准词我们用 s 来表示。

代码：用于 Flashtext 的初始化并向字典添加关键字

class FlashText(object):

    def __init__(self, case_sensitive=False):

        self._keyword = '_keyword_'   # end of term (eot) and key to store standardized name

        sef._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ','])

        self.keyword_trie_dict = dict()

        sefl.case_sensitive = case_sensitive

    def add_keyword(self, keyword, clean_name = None):

        if not clean_name and keyword:

            clean_name = keyword

        if keyword and clean_name:

            # if both keyword and clean_name are not empty.

            if not self.case_sensitive:

                # if not case_sensitive then lowercase the keyword

                keyword = keyword.lower()

            current_dict = self.keyword_trie_dict

            for letter in keyword:

                current_dict = current_dict.setdefault(letter, {})

            current_dict[self._keyword] = clean_name

输出

上述程序将会创建一个字典，如图3所示。

2.3.2 关键字搜索

一旦我们将所有的词都添加到 trie 字典中，我们就可以在输入字符串中找到关键字。

输入

字符串 x = a1a2…an，其中 ai 是输入字符串 x 中的第 i 个字符。

代码：Python 代码用来获取字典中的输入字符串中的关键字。

def extract_keywords(self, sentence):

    keywords_extracted = []

    if not self.case_sensitive:

        # if not case_sensitive then lowercase the sentence

        sentence = sentence.lower()

    current_dict = self.keyword_trie_dict

    sequence_and_pos = 0

    idx = 0

    sentence_len = len(sentence)

    while idx < sentence_len:

        char = sentence[idx]

        # when we reach a character that might denote word end

        if char not in self.non_word_boundaries:

            # if eot is present in current_dict

            if self._keyword in current_dict or char in current_dict:

                # update longest sequence found

                sequence_found = None

                longest_sequence_found = None

                is_longer_seq_found = False

                if self._keyword in current_dict:

                    sequence_found = current_dict[self._keyword]

                    longest_sequence_found = current_dict[self._keyword]

                    sequence_end_pos = idx

                # re look for longest_sequence from this position

                if char in current_dict:

                    current_dict_continued = current_dict[char]

                    idy = idx + 1

                    while idy < sentence_len:

                        inner_char = sentence[idy]

                        if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued:

                            # update longest sequence found

                            longest_sequence_found = current_dict_continued[self._keyword]

                            sequence_end_ops = idy

                            is_longer_seq_found = True

                        if inner_char in current_dict_continued:

                            current_dict_continued = current_dict_continued[inner_char]

                        else:

                            break

                        idy += 1

                    else:

                        # end of sentence reached

                        if self._keyword in current_dict_continued:

                            # update longest sequence found

                            longest_sequence_found = current_dict_continued[self._keyword]

                            sequence_end_pos = idy

                            is_longer_seq_found = True

                    if is_longer_seq_found:

                        idx = sequence_end_pos

                current_dict = self.keyword_trie_dict

                if longest_sequence_found:

                    keywords_extracted.append(longest_sequence_found)

            else:

                # we reset current_dict

                current_dict = self.keyword_trie_dict

        elif char in current_dict:

            # char is present in current dictionary position

            current_dict = current_dict[char]

        else:

            # we reset current_dict

            current_dict = self.keyword_trie_dict

            # skip to end of word

            idy = idx + 1

            while idy < sentence_len:

                char = sentence[idy]

                if char not in self.non_word_boundaries:

                    break

                idy += 1

            idx = idy

        # if we are end of sentence and have a sequence discovered

        if idx + 1 >= sentence_len:

            if self._keyword in current_dict:

                sequence_found = current_dict[self._keyword]

                keywords_extracted.append(sequence_found)

        idx += 1

    return keywords_extracted

输出

返回一个列表，输出字符串 x 中找到的所有标准化之后的字，如图 4 所示。

2.3.3 关键字替换

我们使用相同的字典用标准化的字来替换输入字符串中的关键字。

输入

输入字符串 x = a1a2…an，其中 ai 表示第 i 个字符。

代码：Python 代码用于从输入字符串中用标准词替换。

def replace_keywords(self, sentence):

    new_sentence = ''

    orig_sentence = sentence

    if not self.case_sensitive:

        sentence = sentence.lower()

    current_word = ''

    current_dict = self.keyword_trie_dict

    current_white_space = ''

    sequence_end_pos = 0

    idx = 0

    sentence_len = len(sentence)

    while idx < sentence_len:

        char = sentence[idx]

        current_word += orig_sentence[idx]

        # when we reach whitespace

        if char not in self.non_word_boundaries:

            current_white_space = char

            # if end is present in current_dict

            if self._keyword in current_dict or char in current_dict:

                # update longest sequence found

                sequence_found = None

                longest_sequence_found = None

                is_longer_seq_found = False

                if self._keyword in current_dict:

                    sequence_found = curretn_dcit[self._keyword]

                    longest_sequence_found = current_dict[self._keyword]

                    sequence_end_pos = idx

                # re look for longest_sequence from this position

                if char in current_dict:

                    current_dict_continued = current_dict[char]

                    current_word_continued = current_word

                    idy = idx + 1

                    while idy < sentence_len:

                        inner_char = sentence[idy]

                        current_word_continued += orig_sentence[idy]

                        if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continuted:

                            # update longest sequence found

                            current_white_space = inner_char

                            longest_sequence_found = current_dict_continued[self._keyword]

                            sequence_end_pos = idy

                            is_longer_seq_found = True

                        if inner_char in current_dict_continued:

                            current_dict_continued = curretn_dict_continued[inner_char]

                        else:

                            break

                        idy += 1

                    else:

                        # end of sentence reached.

                        if self._keyword in current_dict_continued:

                            # update longest sequence found

                            current_white_space = ''

                            longest_sequence_found = current_dict_continued[self._keyword]

                            sequence_end_pos = idy

                            is_longer_seq_found = True

                    if is_longer_seq_found:

                        idx = sequence_end_pos

                        current_word = current_word_continued

                current_dict = self.keyword_trie_dict

                if longest_sequence_found:

                    new_sentence += longest_sequence_found + curretn_white_space

                    current_word = ''

                    current_white_space = ''

                else:

                    new_sentence += current_word

                    current_word = ''

                    current_white_space = ''

            else:

                # we reset current_dict

                current_dict = self.keyword_trie_dict

                new_sentence += current_word

                current_word = ''

                current_white_space = ''

        elif char in current_dict:

            # we can continue from this char

            current_dict = current_dict[char]

        else:

            # we reset current_dict

            current_dict = self.keyword_trie_dict

            # skip to end of word

            idy = idx + 1

            while idy < sentence_len:

                char  = sentence[idy]

                current_word += orig_sentence[idy]

                if char not in self.non_word_boundaries:

                    break

                idy += 1

            idx = idy

            new_sentence += current_word

            current_word = ''

            current_white_space = ''

        # if we are end of sentence and have a sequence disv=convered

        if idx + 1 >= sentence_len:

            if self._keyword in current_dict:

                sequence_found = current_dict[self._keyword]

                new_sentence += sequence_found

        idx += 1

    return new_sentence

输出

在字符串 x 中找到需要替换的词，然后用标准词进行替换输出，如图 5 所示。

3. Flashtext 和正则表达式的基准测试

正如在图 1 和图 2 中所展示的，Flashtext 比正则表达式的处理速度要快很多。现在我们来做一些基准测试来更加说明这个问题。

3.1 关键字搜索

我们利用 Python 代码来实现这个关键字搜索的基准测试。首先，我们会随机创建一个 10K 的语料库。然后，我们将从单词列表中选择 1K 的词用来创建一个新的文档。

我们将从语料库中选择 k 个词，其中 k ∈ {0, 1000, 2000, .. , 20000}。我们将使用正则表达式和 Flashtext 分别对这个文档中的关键词进行搜索，然后对比分析。具体 Python 代码如下：

from flashtext.keyword import KeywordProcessor

import random

import re

import string

import time

def get_word_of_length(str_length):

    # generate a random word of given length

    return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))

# generate a list of 100K words of randomly chosen size

all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]

print('Count  |  FlashText  | Regex  ')

print('------------------------------')

for keywords_length in [0, 1000, 5000, 10000, 15000]:

    # chose 1000 terms and create a string to search in.

    all_words_chosen = random.sample(all_words, 1000)

    story = ' '.join(all_words_chosen)

    # get unique keywrods from the list of words generated.

    unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))

    # compile Regex

    compiled_re = re.compile('|'.join([r'\b' + keyword + r'\b' for keyword in unique_keywords_sublist]))

    # add keywords to Flashtext

    keyword_processor = KeywordProcessor()

    keyword_processor.add_keywords_from_list(unique_keywords_sublist)

    # time the modules

    start = time.time()

    _ = keyword_processor.extract_keywords(story)

    mid = time.time()

    _ = compiled_re.findall(story)

    end = time.time()

    # print output

    print(str(keywords_length).ljust(6), '|',

          "{0:.5f}".format(mid - start).ljust(9), '|',

          "{0:.5f}".format(end - mid).ljust(9), '|')

    # output: Data for Figure 1

3.2 关键词替换

下面的代码是用来做关键词替换的 Python 代码。

from flashtext.keyword import KeywordProcessor

import random

import string

import re

import time

def get_word_of_length(str_length):

    # generate a random word of given length

    return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))

# generate a list of 100K words of randomly chosen size

all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]

print('Count  | FlashText | Regex    ')

print('-------------------------------')

for keywords_length in range(1, 20002, 1000):

    # chose 5000 terms and create a string to search in.

    all_words_chosen = random.sample(all_words, 5000)

    story = ' '.join(all_words_chosen)

    # get unique keywords from the list of words generated.

    unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))

    # compile regex

    # source: https://*.com/questions/6116978/python-replace-multiple-strings

    rep = dict([(key, '_keyword_') for key in unique_keywords_sublist])

    compiled_re = re.compile("|".join(rep.keys()))

    # add keywords to flashtext

    keyword_processor = KeywordProcessor()

    for keyword in unique_keywords_sublist:

        keyword_processor.add_keyword(keyword, '_keyword_')

    # time the modules

    start = time.time()

    _ = keyword_processor.replace_keywords(story)

    mid = time.time()

    _ = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story)

    end = time.time()

    # print output

    print(str(keywords_length).ljust(6), '|',

          "{0:.5f}".format(mid - start).ljust(9), '|',

          "{0:.5f}".format(end - mid).ljust(9), '|',)

# output: Data for Figure 2

3.3 结论

正如我们在上面看到的对比结果，Flashtext 在关键字搜索和替换上面比正则表达式都快很多。特别是在处理大规模数据的时候，这个优势会更加的显示被体现出来。

4. Flashtext 使用文档

4.1 安装

pip install flashtext

4.2 使用例子

4.2.1 关键字提取

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)

>>> keyword_processor.add_keyword('Big Apple', 'New York')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')

>>> keywords_found

>>> # ['New York', 'Bay Area']

4.2.2 关键字替换

>>> keyword_processor.add_keyword('New Delhi', 'NCR region')

>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')

>>> new_sentence

>>> # 'I love New York and NCR region.'

4.2.3 区分大小写字母

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor(case_sensitive=True)

>>> keyword_processor.add_keyword('Big Apple', 'New York')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')

>>> keywords_found

>>> # ['Bay Area']

4.2.4 关键字不清晰

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('Big Apple')

>>> keyword_processor.add_keyword('Bay Area')

>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')

>>> keywords_found

>>> # ['Big Apple', 'Bay Area']

4.2.5 同时添加多个关键词

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_dict = {

>>>     "java": ["java_2e", "java programing"],

>>>     "product management": ["PM", "product manager"]

>>> }

>>> # {'clean_name': ['list of unclean names']}

>>> keyword_processor.add_keywords_from_dict(keyword_dict)

>>> # Or add keywords from a list:

>>> keyword_processor.add_keywords_from_list(["java", "python"])

>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')

>>> # output ['product management', 'java']

4.2.6 删除关键字

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_dict = {

>>>     "java": ["java_2e", "java programing"],

>>>     "product management": ["PM", "product manager"]

>>> }

>>> keyword_processor.add_keywords_from_dict(keyword_dict)

>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))

>>> # output ['product management', 'java']

>>> keyword_processor.remove_keyword('java_2e')

>>> # you can also remove keywords from a list/ dictionary

>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})

>>> keyword_processor.remove_keywords_from_list(["java programing"])

>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')

>>> # output ['product management']

有时候我们会将一些特殊符号作为字符边界，比如空格，\ 等等。为了重新设定字边界，我们需要添加一些符号告诉算法，这是单词字符的一部分。

>>> from flashtext import KeywordProcessor

>>> keyword_processor = KeywordProcessor()

>>> keyword_processor.add_keyword('Big Apple')

>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))

>>> # ['Big Apple']

>>> keyword_processor.add_non_word_boundary('/')

>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))

>>> # []