对腾讯的800W词向量(https://ai.tencent.com/ailab/nlp/zh/embedding.html)进行简单处理
因为文件太大无法直接打开,用编译器读取前十行,看一下词向量的存储格式。
if __name__ == '__main__':
filename="D:\Dataforwork\Tencent_AILab_ChineseEmbedding\Tencent_AILab_ChineseEmbedding.txt"
with open(filename,'r',encoding='utf-8') as f:
for i in range(10):
line=f.readline()
print(line)
得到结果
8824330 200
</s> 0.002001 0.002210 -0.001915 -0.001639 0.000683 0.001511 0.000470 0.000106 -0.001802 0.001109 -0.002178 0.000625 -0.000376 -0.000479 -0.001658 -0.000941 0.001290 0.001513 0.001485 0.000799 0.000772 -0.001901 -0.002048 0.002485 0.001901 0.001545 -0.000302 0.002008 -0.000247 0.000367 -0.000075 -0.001492 0.000656 -0.000669 -0.001913 0.002377 0.002190 -0.000548 -0.000113 0.000255 -0.001819 -0.002004 0.002277 0.000032 -0.001291 -0.001521 -0.001538 0.000848 0.000101 0.000666 -0.002107 -0.001904 -0.000065 0.000572 0.001275 -0.001585 0.002040 0.000463 0.000560 -0.000304 0.001493 -0.001144 -0.001049 0.001079 -0.000377 0.000515 0.000902 -0.002044 -0.000992 0.001457 0.002116 0.001966 -0.001523 -0.001054 -0.000455 0.001001 -0.001894 0.001499 0.001394 -0.000799 -0.000776 -0.001119 0.002114 0.001956 -0.000590 0.002107 0.002410 0.000908 0.002491 -0.001556 -0.000766 -0.001054 -0.001454 0.001407 0.000790 0.000212 -0.001097 0.000762 0.001530 0.000097 0.001140 -0.002476 0.002157 0.000240 -0.000916 -0.001042 -0.000374 -0.001468 -0.002185 -0.001419 0.002139 -0.000885 -0.001340 0.001159 -0.000852 0.002378 -0.000802 -0.002294 0.001358 -0.000037 -0.001744 0.000488 0.000721 -0.000241 0.000912 -0.001979 0.000441 0.000908 -0.001505 0.000071 -0.000030 -0.001200 -0.001416 -0.002347 0.000011 0.000076 0.000005 -0.001967 -0.002481 -0.002373 -0.002163 -0.000274 0.000696 0.000592 -0.001591 0.002499 -0.001006 -0.000637 -0.000702 0.002366 -0.001882 0.000581 -0.000668 0.001594 0.000020 0.002135 -0.001410 -0.001303 -0.002096 -0.001833 -0.001600 -0.001557 0.001222 -0.000933 0.001340 0.001845 0.000678 0.001475 0.001238 0.001170 -0.001775 -0.001717 -0.001828 -0.000066 0.002065 -0.001368 -0.001530 -0.002098 0.001653 -0.002089 -0.000290 0.001089 -0.002309 -0.002239 0.000721 0.001762 0.002132 0.001073 0.001581 -0.001564 -0.001820 0.001987 -0.001382 0.000877 0.000287 0.000895 -0.000591 0.000099 -0.000843 -0.000563
的 0.209092 -0.165459 -0.058054 0.281176 0.102982 0.099868 0.047287 0.113531 0.202805 0.240482 0.026028 0.073504 0.010873 0.010201 -0.056060 -0.063864 -0.025928 -0.158832 -0.019444 -0.144610 -0.124821 0.000499 -0.050971 0.113983 0.088150 0.080318 -0.145976 0.093325 0.139695 -0.082682 -0.034356 0.061241 -0.090153 0.053166 -0.171991 -0.187834 0.115600 0.219545 -0.200234 -0.106904 0.033836 0.005707 0.484198 0.147382 -0.165274 0.094883 -0.202281 -0.638371 -0.127920 -0.212338 -0.250738 -0.022411 -0.315008 0.169237 -0.002799 0.019125 0.017462 0.028013 0.195060 0.036385 -0.051681 0.154037 0.214785 -0.179985 -0.020429 -0.044819 -0.074923 0.105441 -0.081715 -0.034099 -0.096518 -0.004290 0.095423 0.234515 -0.138332 0.134917 0.082070 0.051714 0.159327 0.061818 0.037091 0.239265 0.073274 0.170960 0.223636 -0.187691 -0.206850 -0.051000 -0.269477 -0.116970 0.213069 -0.096122 0.035362 -0.254648 0.021978 0.071687 0.109870 -0.104643 -0.175653 0.097061 -0.068692 0.196374 0.007704 0.072367 -0.275905 0.217282 -0.056664 -0.321484 -0.004813 -0.041167 -0.118400 -0.159937 0.065294 -0.092538 0.013975 -0.219047 -0.058431 -0.177256 -0.043169 -0.151647 -0.006049 -0.279595 -0.005488 0.096733 0.147219 0.197677 -0.088133 0.053465 0.038738 0.059665 -0.132819 0.019606 0.224926 -0.176136 -0.411968 -0.044071 -0.120198 -0.107929 -0.001640 0.036719 -0.243131 -0.273457 -0.317418 -0.079236 0.054842 -0.143945 0.168189 -0.013057 -0.145664 0.135278 0.029447 -0.141014 -0.183899 -0.080112 -0.113538 0.071163 0.134968 0.141939 0.144405 -0.249114 0.454654 -0.077072 -0.001521 0.298252 0.160275 0.085942 -0.213363 0.083022 -0.000400 0.134826 -0.000681 -0.017328 -0.026751 0.111903 0.010307 -0.124723 0.031472 0.081697 0.071449 0.011486 -0.091571 -0.039319 -0.112756 0.171106 0.026869 -0.077058 -0.052948 0.252645 -0.035071 0.040870 0.277828 0.085193 0.006959 -0.048913 0.279133 0.169515 0.068156 -0.278624 -0.173408 0.035439
。 0.128825 -0.267995 0.000795 0.263639 0.097538 0.010807 -0.028897 0.117577 0.156264 0.188688 0.039535 0.003052 -0.006556 -0.000542 -0.029001 -0.077706 -0.032283 -0.143411 -0.077546 -0.190037 -0.133469 -0.022468 -0.072961 0.144973 0.098850 0.084037 -0.190143 0.088755 0.138181 -0.105819 0.017681 0.097694 -0.041753 0.067210 -0.168004 -0.220518 0.109195 0.213823 -0.257998 -0.077525 0.020666 0.093258 0.479364 0.118419 -0.075531 0.027833 -0.218220 -0.606599 -0.070421 -0.199476 -0.233392 0.043417 -0.329303 0.209230 0.076914 -0.104362 0.062803 0.135248 0.140913 0.004518 0.025996 0.098697 0.151008 -0.097791 -0.107083 0.031974 -0.128378 0.083871 -0.084132 -0.081561 -0.130019 -0.000076 0.044137 0.141331 -0.015837 0.040789 0.059898 0.090438 0.109648 0.057426 0.018093 0.301312 0.052632 0.235123 0.242908 -0.095973 -0.095958 0.030708 -0.384714 -0.164777 0.173542 -0.068162 0.008059 -0.271924 0.051049 0.104490 0.010147 -0.105521 -0.127305 0.166114 -0.088172 0.226118 0.063335 0.109702 -0.240658 0.222202 0.038648 -0.355815 -0.015869 -0.118660 -0.072290 -0.113268 0.022802 -0.181717 0.042459 -0.112042 -0.146667 -0.162785 -0.164891 -0.124024 0.114017 -0.279552 0.043898 0.119786 0.074259 0.180340 -0.151963 0.054344 0.062367 0.081112 -0.201022 0.007162 0.285948 -0.080341 -0.373533 0.010599 -0.068789 -0.048280 0.024877 -0.023046 -0.260779 -0.348213 -0.316790 0.094536 0.131866 -0.186581 0.250213 0.002131 -0.133212 0.041250 -0.022561 -0.120896 -0.115173 -0.136738 -0.106566 -0.004902 0.111645 0.065397 0.250295 -0.107772 0.364187 0.091516 0.020277 0.247552 0.125857 0.037309 -0.204360 0.228403 0.029138 0.095838 -0.000847 0.018618 -0.083734 0.204566 -0.044542 -0.150615 -0.023076 -0.048712 0.146612 0.015543 -0.081624 0.049204 -0.175686 0.155870 0.020196 -0.102833 -0.046915 0.256491 -0.039505 0.110278 0.273165 0.062938 0.050126 -0.091656 0.221263 0.245667 0.096496 -0.286896 -0.273779 0.031194
, -0.098885 -0.335186 -0.092543 0.145954 0.062586 -0.031792 0.002963 0.095666 0.136642 0.228925 0.290372 0.031822 0.079649 0.004957 0.186027 -0.120654 -0.085876 -0.038904 -0.188163 -0.049018 -0.031889 -0.076527 -0.044837 0.179972 0.280003 -0.025408 -0.129556 0.204244 0.135897 0.040038 0.066776 0.229703 -0.005452 0.053202 -0.078578 -0.360414 0.217936 -0.122313 -0.097521 -0.371549 -0.126917 0.004912 0.572844 0.188218 0.075694 -0.068327 -0.113156 -0.520387 -0.015882 -0.063246 -0.297707 0.082145 -0.346270 0.305896 0.284688 -0.153902 0.063972 0.165861 0.052901 -0.124356 -0.022029 0.037506 -0.072065 -0.150246 -0.155110 -0.048658 -0.146387 0.076884 -0.191979 0.057570 -0.046671 0.100186 0.082296 0.329750 0.002845 0.143715 0.133200 0.173783 -0.008137 -0.027748 0.037880 0.351362 0.104906 0.196654 0.344083 -0.166356 0.093684 0.067884 -0.388054 -0.158803 0.189325 0.120307 -0.104332 -0.338101 0.059107 0.086494 -0.146971 -0.134498 -0.158104 0.193087 -0.149227 0.301898 0.039827 0.119830 -0.138713 0.048479 0.366490 -0.550691 0.030574 -0.211866 -0.015142 -0.217475 -0.082156 -0.169474 -0.026249 -0.139175 -0.068187 -0.220310 -0.305426 -0.120390 0.139736 -0.337384 0.066118 0.054236 0.126146 0.276481 -0.220184 0.035463 0.144391 0.097852 -0.075931 0.104917 0.396221 -0.025875 -0.320798 -0.042056 0.008985 -0.151966 -0.155902 -0.251964 -0.229793 -0.382696 -0.311796 0.126690 0.180148 -0.149632 0.153048 0.037665 -0.118928 0.090724 -0.031901 -0.138475 -0.087738 -0.117940 -0.139657 -0.026110 -0.047370 -0.088814 0.213824 0.125929 0.400300 0.213129 -0.108517 0.283643 0.068651 -0.031996 -0.250347 0.226049 0.175005 -0.007276 0.006655 0.053348 0.013796 0.198805 -0.052261 0.014431 -0.013838 -0.162294 0.111247 0.078699 0.000247 0.017076 -0.150237 0.071130 0.070735 0.024919 -0.139731 0.063105 -0.061940 0.165181 0.221614 0.173825 0.088357 -0.119238 0.135927 0.480134 0.144222 -0.322631 -0.299947 0.179642
了 0.257969 -0.288066 0.105582 0.281781 0.262464 0.024700 0.201691 0.005893 0.189025 0.123702 0.000702 -0.024353 0.074768 -0.120963 0.053227 -0.115575 -0.004054 -0.215095 -0.157094 -0.259438 0.062279 0.142544 -0.057884 0.149006 0.142659 -0.055425 -0.023676 0.048273 0.274754 -0.119763 -0.068765 0.031547 -0.045737 0.171814 -0.255463 -0.086406 0.171091 0.073908 -0.288000 0.051408 0.014997 0.005331 0.401295 0.095939 -0.045079 0.089809 -0.274839 -0.682658 -0.140976 -0.104340 -0.260310 0.023253 -0.258755 0.066476 -0.121786 0.001883 0.101805 0.168083 0.184144 0.069670 0.060302 0.190894 0.035508 0.005831 -0.160862 -0.194421 0.027660 -0.004459 -0.094092 -0.127693 -0.151391 0.004516 0.176035 0.156662 -0.035896 0.040939 0.056237 0.105161 0.208659 -0.029048 0.050157 0.265578 0.121959 0.337101 0.153380 -0.229313 -0.225776 0.144097 -0.110946 -0.118999 0.063307 0.043647 0.038288 -0.268226 -0.166147 0.131352 0.087143 -0.138708 -0.125388 0.124644 -0.029785 0.235381 -0.140240 -0.022642 -0.285102 0.207541 -0.066322 -0.421221 -0.050854 -0.180806 0.063572 -0.221671 0.144535 -0.106335 0.039128 -0.323642 0.136248 -0.268427 -0.005231 0.022319 0.100253 -0.346457 -0.058811 -0.041238 0.045793 0.122920 -0.153493 0.146689 0.018370 0.121537 -0.134108 -0.041995 0.198174 -0.130834 -0.339254 0.032741 -0.095624 0.119803 0.061630 0.166394 -0.330792 -0.224072 -0.270991 0.125234 -0.057835 -0.233979 0.228011 0.050660 -0.160570 0.180151 -0.083350 -0.156653 -0.135692 -0.145908 -0.096677 -0.066022 0.088564 -0.045435 0.054121 -0.357914 0.448524 0.141443 -0.094841 0.212140 0.053208 0.114341 -0.139839 0.239768 0.095185 0.049941 -0.141136 -0.093853 -0.107702 0.065235 -0.007106 -0.292004 0.048486 -0.063324 0.033387 0.068469 -0.210287 -0.039360 -0.273291 0.143838 -0.105371 -0.155323 -0.020802 0.499330 -0.100334 -0.001517 0.252245 0.093027 0.057058 0.062758 0.277739 0.240228 0.184851 -0.348845 -0.224438 0.065580
、 -0.232738 -0.288685 -0.112967 0.243445 0.063453 -0.011561 -0.008042 0.278480 0.095796 0.181892 -0.057935 0.079951 0.125428 0.016825 -0.255156 -0.064579 0.250334 -0.001956 0.027310 -0.044554 -0.371264 0.083583 -0.022147 -0.058193 0.087624 0.092437 -0.065254 0.195590 0.049069 -0.050154 0.115945 0.059864 0.195945 0.415346 -0.115533 -0.351305 0.211141 0.419050 -0.011927 -0.133281 0.154127 0.120905 0.361275 0.238737 0.020675 0.248509 -0.457214 -0.695160 0.073428 -0.235259 -0.171847 -0.067426 -0.322873 0.045355 0.003495 -0.035630 0.083483 0.139785 0.396549 -0.317331 0.075060 -0.115931 0.259889 -0.215689 -0.269958 -0.119417 -0.129016 -0.055042 -0.134885 0.205673 0.098041 0.017878 0.349834 0.326317 -0.190929 -0.027716 0.186459 0.112990 -0.105028 -0.338298 -0.033512 0.362206 0.131871 0.267529 0.199675 -0.083040 0.178176 -0.017078 -0.048815 -0.264116 0.308941 -0.101069 -0.033861 -0.300839 -0.038865 0.123880 0.077158 -0.102373 -0.234012 0.206596 -0.271171 0.355674 0.226836 0.271766 -0.140208 0.333076 0.049434 -0.232454 -0.198579 -0.282786 0.146072 -0.181280 0.052588 -0.238193 -0.264306 -0.197291 -0.121021 -0.019134 -0.134523 0.124645 0.063116 -0.529543 0.276034 0.221463 0.129541 0.249353 -0.156155 0.058278 0.134928 -0.089215 -0.135324 0.178495 0.467898 -0.105199 -0.421901 0.065678 0.164576 -0.115533 -0.122835 -0.313874 -0.112314 -0.434833 -0.407134 -0.158788 -0.045447 0.226856 0.589111 -0.189425 -0.186958 -0.056326 0.001208 -0.191341 -0.082234 0.077402 -0.200340 0.196309 -0.052946 -0.028392 0.425342 0.082913 0.513461 -0.038077 -0.002634 0.191342 0.266861 0.092981 -0.107972 0.214609 -0.247476 -0.031515 0.036883 0.108790 -0.013948 -0.064460 -0.093925 -0.234909 -0.289582 -0.107822 0.319905 -0.051341 -0.303210 0.078185 0.073010 0.140941 0.177822 0.158210 0.134216 0.113107 0.010003 -0.131399 0.231865 0.280990 0.148913 -0.318607 -0.000751 0.380837 -0.014786 0.199341 -0.243304 0.048623
“ 0.122462 -0.236367 0.083894 0.054253 0.076510 0.101018 -0.157237 0.032625 0.010705 0.285742 0.108899 -0.089643 0.093323 -0.072528 0.173572 0.032265 -0.144413 -0.382213 -0.127190 -0.104082 -0.112731 -0.019033 -0.153905 0.057671 0.141809 0.141354 -0.099416 0.046555 0.015345 -0.005729 -0.086419 0.047950 0.022829 0.175444 -0.244222 -0.108688 0.008861 0.206845 -0.368806 -0.191608 0.015095 0.131515 0.356959 0.152526 -0.127977 0.035397 -0.333871 -0.444046 -0.084535 -0.072250 -0.184907 0.261136 -0.224213 0.023162 0.042897 0.007341 0.218563 0.251269 0.068582 0.004821 0.049429 0.211055 0.167730 -0.103785 0.165069 0.176739 0.032481 -0.120746 -0.069707 -0.136771 -0.080435 0.045660 0.316021 0.011224 0.111150 0.097139 -0.042860 0.215534 0.075728 0.158713 0.133570 0.407183 0.081891 0.175441 -0.023760 -0.047137 -0.011409 0.098810 -0.290375 -0.232262 0.156042 -0.249738 0.054919 -0.149459 0.122781 -0.075528 0.044226 -0.070409 -0.180491 0.041084 -0.058707 0.266652 -0.023937 0.251588 -0.414041 0.582360 0.059408 -0.595803 -0.177864 -0.080663 0.086218 -0.240464 0.239131 0.034891 0.117690 -0.200251 -0.004631 -0.234003 -0.200123 0.003599 0.233423 -0.207720 -0.041890 0.002190 -0.123027 -0.018789 -0.022032 -0.012981 0.060304 0.168908 -0.179670 -0.056837 0.255375 -0.233489 -0.502085 -0.165754 0.002228 -0.057791 0.026687 -0.165600 -0.197780 -0.270278 -0.211250 0.030570 0.123118 -0.271800 0.122304 0.059079 -0.165242 0.095742 -0.247362 -0.288905 -0.059444 -0.239531 -0.227575 -0.004750 0.154330 0.027113 0.078265 -0.248037 0.560899 -0.039299 0.147028 0.121306 0.100932 0.055226 -0.186161 0.149411 0.092626 0.060446 0.175370 -0.078933 -0.060286 0.281467 0.068393 -0.075060 -0.251365 -0.156435 0.082520 0.124177 -0.067274 0.136409 -0.135146 0.263874 0.070770 0.162158 -0.077261 0.308476 -0.091539 0.126403 0.067644 0.045887 0.147458 -0.117867 0.019049 0.425965 -0.014370 -0.253605 -0.113628 0.106659
” 0.108815 -0.241997 0.082369 0.063541 0.054383 0.123942 -0.167157 0.021991 0.031640 0.262305 0.131012 -0.083021 0.101425 -0.081870 0.159832 0.031488 -0.146121 -0.333952 -0.129508 -0.088991 -0.127734 -0.016296 -0.123858 0.061714 0.168111 0.115000 -0.077976 0.069735 0.043334 0.015265 -0.077503 0.059783 0.031921 0.138752 -0.256197 -0.167446 0.026034 0.181538 -0.369619 -0.187011 0.021129 0.115088 0.349497 0.162885 -0.120980 0.045805 -0.316720 -0.443893 -0.106555 -0.111270 -0.191840 0.228302 -0.257261 -0.012916 0.041418 -0.002460 0.224961 0.247112 0.084027 -0.009874 0.085379 0.192162 0.158371 -0.088699 0.148566 0.190253 0.024584 -0.098989 -0.072661 -0.133928 -0.065171 0.092618 0.285007 0.005544 0.104271 0.101575 -0.028191 0.199444 0.093800 0.179894 0.099258 0.411289 0.070870 0.171074 -0.009207 -0.035159 0.007757 0.116820 -0.299558 -0.212453 0.163653 -0.264225 0.037925 -0.155702 0.136437 -0.081984 0.046029 -0.056247 -0.187565 0.011201 -0.033462 0.267564 -0.028860 0.261291 -0.414885 0.583737 0.028150 -0.585286 -0.182570 -0.070103 0.096262 -0.212538 0.215521 -0.006936 0.079491 -0.194930 -0.012284 -0.200511 -0.196653 -0.007712 0.216313 -0.186542 -0.058861 -0.012272 -0.109852 0.005773 -0.020779 -0.012360 0.084804 0.136474 -0.159033 -0.031105 0.239932 -0.207597 -0.489428 -0.131963 -0.018079 -0.076020 0.030093 -0.139627 -0.232451 -0.261891 -0.218330 0.063110 0.140156 -0.264597 0.113596 0.087695 -0.177811 0.114217 -0.241231 -0.294800 -0.062543 -0.245346 -0.239901 -0.009816 0.146287 -0.003436 0.116481 -0.240626 0.532931 -0.006128 0.155324 0.132983 0.096058 0.062446 -0.207618 0.135445 0.088684 0.039472 0.151041 -0.089525 -0.082138 0.249260 0.088288 -0.091733 -0.241065 -0.175282 0.072563 0.103602 -0.051329 0.105472 -0.143385 0.267700 0.075299 0.177726 -0.072988 0.288345 -0.078596 0.132786 0.073861 0.051939 0.150097 -0.122646 0.033182 0.424073 0.021534 -0.266058 -0.147493 0.129246
是 0.088422 -0.220535 0.042321 0.280248 0.158567 0.022675 0.104318 0.164016 0.175014 0.483962 0.128477 0.111126 0.031500 -0.000157 -0.097737 0.247697 -0.179488 -0.240289 -0.136103 -0.168298 -0.147298 0.021108 0.120252 0.076418 0.274915 0.074118 -0.109737 0.132861 0.303832 0.025579 -0.001565 0.046626 -0.174546 0.069820 -0.233192 -0.319214 0.102713 0.376593 -0.278854 -0.190161 0.064110 0.152048 0.481907 0.112520 -0.237646 0.036225 -0.054809 -0.675327 -0.004792 -0.201795 -0.314029 -0.002380 -0.308852 0.154706 -0.015906 -0.025239 -0.115753 0.012128 0.177280 0.193762 0.064173 0.136234 0.119523 -0.144031 0.001965 -0.019349 -0.090950 0.130274 -0.212409 -0.300901 -0.271812 0.048882 0.046211 0.264830 -0.007732 0.212248 0.170197 -0.036245 0.126349 0.037053 0.014784 0.282051 0.090078 0.191075 0.188490 -0.240984 0.005043 -0.012079 -0.339504 -0.255159 0.253130 -0.122627 0.070942 -0.228254 0.148199 0.088679 -0.028714 -0.072517 -0.252825 0.140066 0.037613 0.168659 0.053629 0.064994 -0.262946 0.252412 -0.095863 -0.409000 0.020319 -0.120344 -0.248578 -0.148610 0.120841 -0.206529 0.099486 -0.144703 0.096828 -0.118727 -0.185494 -0.068001 -0.054437 -0.155130 -0.012420 0.233209 0.015379 0.168919 -0.135505 0.030401 -0.111819 -0.008094 -0.198661 -0.005637 0.229044 -0.131276 -0.294233 0.013832 -0.035503 -0.125856 -0.158669 -0.120601 -0.322779 -0.117246 -0.162126 -0.140845 0.118956 -0.244393 0.266027 -0.039224 -0.294961 0.162399 -0.046758 -0.109945 -0.273991 -0.052170 -0.081680 0.060395 0.223162 0.063811 0.086414 -0.239465 0.311980 -0.043291 -0.100374 0.136088 0.180767 -0.064107 -0.222279 0.027649 -0.062463 0.141550 -0.009524 -0.122633 0.078290 0.118104 0.068727 -0.020049 -0.064007 0.141176 0.123716 -0.040491 -0.147908 -0.171939 0.018413 0.055367 0.191712 0.042501 -0.071932 0.301518 -0.111157 0.064090 0.038439 0.139372 0.075114 -0.148898 0.236133 0.344508 0.014930 -0.195834 -0.292445 0.097225
第一行的<‘/s>表示的应该是空字符吧。
很明显有些词是需要过滤的。
r'^\d+$', # 纯数字
r'^(.)\1{2,}$', # 重复3个以上
r'^([a-z])+$', # 纯英文字母
r'^\d([a-z])+$', #数字加字母
r'^([a-z])[a-z\-\.,;&:]+$', # 英文字母+英文标点
r'^.+[。、“”?!:()~\?\-……’‘)(/—【】].*$', # 带中文标点的
r'^.+[\-\.,;&:\'!;>_<].*$', # 带英文标点符号,且长度大于1的
r'^[。、“”?!:()~\?\-……’‘))(/—【】]+$', #中文标点
r'^[\-\.,;&:\'"!;>_<]+$', #英文标点
r'^[~!@#¥%……&*~「」=+]+$', # 特殊字符
r'^[\s]+$',
尝试将词向量分成词与向量两个部分
chars.append(x[0])
v = [float(i) for i in x[1:]]
vectors.append(v)
结果发现一次读入的话,小小的笔记本并不能承受这么多的数据。
于是为了电脑能跑出来,做了两种处理。一是只保留两个字的词,二是将800W分成四次来跑(因为第一次全跑时候跑到300+W的时候电脑就卡死机了,所以划四次来做)
r'^[\u4E00-\u9FA5]{,1}$', # 任何单个中文字符
r'^[\u4E00-\u9FA5]{3,}$' # 三个及以上中文字符
得到了这么词和向量的文件(用了一次浏览器打开npy,图标就成这样了)
分次就是用了简单的if,else。毕竟只是一个预处理,在最后的搜索中也用不到这部分代码,所以也没考虑什么运行效率,怎么简单怎么来就好了。
结果