20210611 word2vec 的代码实现

2024-04-11 13:19:19

使用第三方包进行词向量的具体实现，Word2Vec 是一种词嵌入（Word Embedding）方法；它可以计算每个单词在其给定语料库环境下的分布式词向量（Distributed Representation，亦直接被称为词向量）。词向量表示可以在一定程度上刻画每个单词的语义。

1 简单用法
1-1 读取语料
有 3 种方式
1 语料可以存储在内存中，格式为[[word1,word2,word3...],[word1,word2,word3...],...]，列表中每一个子列表为分完词的一篇文档
2 通过 LineSentence 的方式
class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)；source为可读文件路径，文件每一行代表一篇文档，文档是已经经过分词，每个词由空格分隔。max_sentence_length为文章的最大长度，limit为读取前多少篇文档(即前多少行)
class gensim.models.word2vec.PathLineSentences（source，max_sentence_length = 10000，limit = None ）
3 这种方式用的比较少，与LineSentence类似，不过这里传入的是根目录，目录下有多个可读文件；文件格式需要和与LineSentence所需文件格式类似，此函数可处理根目录下所有的文件。

import jieba
from gensim.models import word2vec

1-1-2 内存方式

# 加载自定义词典
jieba.load_userdict("MobilePhone_Userdict.txt")
# 将停用词读出放在stopwords这个列表中
filepath = r'stopwords.txt'
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]

# 读取文件，将其中句子进行分词
def readfile2wordlist(file_path):
    cut_word_list = []
    with open(file_path, 'r', encoding="utf-8-sig") as f:
        for line in f.readlines():
            line = line.strip()
            seg_list = jieba.cut(line)
            seg_list = [i for i in seg_list if i not in stopwords and i!=' ']
            cut_word_list.append(seg_list)
    return cut_word_list

# 未分词语料
file_path = 'mb.txt'
sentences = readfile2wordlist(file_path)
print(sentences[:10])

1-1-3 文件方式

file_path = 'mb_train.txt'
# 使用LineSentence读取语料
sentences = word2vec.LineSentence(file_path,max_sentence_length=10000, limit=None)
print(sentences)

-->
<gensim.models.word2vec.LineSentence object at 0x0000021AEDAD47C0>
如果查看 sentences 中的具体值，需要用 for 循环，类似生成器的感觉

for doument in sentences:
    print(doument) # 得到的结果和内存方式是一样的

2-1 训练word2vec语义向量
# 训练时需要将 word2vec 改成 Word2Vec
# class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5,
# max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
# sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
# trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=(),
# max_final_vocab=None)
# sentence(iterable of iterables):输入语料，与我们上面生成的一致
# SG(INT {1 ，0}) -定义的训练算法。如果是1，则使用skip-gram; 否则，使用CBOW。
# hs：是否采用基于Hierarchical Softmax的模型。参数为1表示使用，0表示不使用
# size(int) - 特征向量的维数。
# window(int) - 句子中当前词和预测词之间的最大距离。
# min_count(int) - 忽略总频率低于此值的所有单词。

# 执行这行后，训练就完成了；sentences 是构建的语料；size是训练后，词向量的大小；alpha是学习率
# window是当前词和当前句子中周边词的最远距离； min_count 是如果频数小于min_count时，就不计算

model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=200)

2-2 保存模型
# model.save(file_name) # file_name:存储模型的名称

model.save('mb_word2vec.bin')

2-3 加载模型
# word2vec.Word2Vec.load(file_name) # file_name:存储的模型的名称

model = word2vec.Word2Vec.load('mb_word2vec.bin')

2-4

# 获取词表
print(model.wv.index2word)
# 获取单词word2vec值
model['Apple']
# 获取单词word2vec值
model['sudo']

# 计算两个单词的语义相似度
print(model.similarity('安卓','苹果'))
print(model.similarity('金立','小米'))

部分代码解释：
1. strip()
https://blog.51cto.com/u_15149862/2812172
2. print(sentences[:10])
https://blog.51cto.com/u_15149862/2704954

部分理论说明：
什么是 Word2Vec？
https://blog.51cto.com/u_15149862/2897151

码农公寓

相关文章