Now, I have three different vocab.txt (glove, tencent.ai, fasttext).
Target: use these vocab.txt to init jieba object in one python file.
Method: if define three different jieba objects, there should be three different cache files here. Of course, should solve how to pass in different cache file paths ? In
/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/jieba/__init__.py, change the parameters of the __init__() function.
51 52 class Tokenizer(object): 53 54 def __init__(self, tmp_dir=None, dictionary=DEFAULT_DICT): 55 self.lock = threading.RLock() 56 if dictionary == DEFAULT_DICT: 57 self.dictionary = dictionary 58 else: 59 self.dictionary = _get_abs_path(dictionary) 60 self.FREQ = {} 61 self.total = 0 62 self.user_word_tag_tab = {} 63 self.initialized = False 64 self.tmp_dir = tmp_dir 65 self.cache_file = None
Result:
1 import sys 2 sys.path.append('/home/user/anaconda3/envs/py36/lib/python3.6/site-packages/jieba') 3 from jieba import Tokenizer 4 class Jieba(object): 5 """docstring for Jie""" 6 def __init__(self, vocab_path, model_path): 7 super(Jie, self).__init__() 8 self.jieba = Tokenizer(os.path.join("/home/user/models/serving_embedding_torch/model_path/torch/data", model_path)) 9 self.jieba.load_userdict(vocab_path) 10 11 def seg(self, text): 12 print(list(self.jieba.cut(text, cut_all=False))) 13 14 a = Jieba('glove.model/vocab.txt', 'glove.model') 15 b = Jieba('tencent.model/vocab.txt', 'tencent.model') 16 c = Jieba('fb.model/vocab.txt', 'fb.model') 17 text = "区块链是一个好方向海派青年公寓龙爪槐" 18 a.seg(text) 19 b.seg(text) 20 c.seg(text)
(py36) user@big-001:~/models/serving_embedding_torch/model_path/torch/data$ python3 peel.py Building prefix dict from the default dictionary ... 2019-10-17 17:14:20,745 DEBUG: Building prefix dict from the default dictionary ... Dumping model to file cache /home/user/models/serving_embedding_torch/model_path/torch/data/glove.model/jieba.cache 2019-10-17 17:14:21,575 DEBUG: Dumping model to file cache /home/user/models/serving_embedding_torch/model_path/torch/data/glove.model/jieba.cache Loading model cost 0.899 seconds. 2019-10-17 17:14:21,644 DEBUG: Loading model cost 0.899 seconds. Prefix dict has been built succesfully. 2019-10-17 17:14:21,644 DEBUG: Prefix dict has been built succesfully. Building prefix dict from the default dictionary ... 2019-10-17 17:14:26,352 DEBUG: Building prefix dict from the default dictionary ... Dumping model to file cache /home/user/models/serving_embedding_torch/model_path/torch/data/tencent.model/jieba.cache 2019-10-17 17:14:27,101 DEBUG: Dumping model to file cache /home/user/models/serving_embedding_torch/model_path/torch/data/tencent.model/jieba.cache Loading model cost 0.805 seconds. 2019-10-17 17:14:27,158 DEBUG: Loading model cost 0.805 seconds. Prefix dict has been built succesfully. 2019-10-17 17:14:27,159 DEBUG: Prefix dict has been built succesfully. Building prefix dict from the default dictionary ... 2019-10-17 17:18:41,279 DEBUG: Building prefix dict from the default dictionary ... Dumping model to file cache /home/user/models/serving_embedding_torch/model_path/torch/data/fb.model/jieba.cache 2019-10-17 17:18:42,045 DEBUG: Dumping model to file cache /home/user/models/serving_embedding_torch/model_path/torch/data/fb.model/jieba.cache Loading model cost 0.822 seconds. 2019-10-17 17:18:42,101 DEBUG: Loading model cost 0.822 seconds. Prefix dict has been built succesfully. 2019-10-17 17:18:42,102 DEBUG: Prefix dict has been built succesfully. ['区块', '链是', '一个', '好', '方向', '海派', '青年', '公寓', '龙爪槐'] ['区块链', '是', '一个', '好方向', '海派青年公寓', '龙爪槐'] ['区块链', '是', '一个', '好', '方向', '海派', '青年', '公寓', '龙爪槐']