1.moses
- moses是由英国爱丁堡大学、德国亚琛工业大学等8家单位联合开发的一个基于短语的统计机器翻译系统。
本文主要介绍 mosesdecoder 中的 tokenizer
github地址
2.安装及使用
2.1 安装
直接 clone 上面 github
git clone https://github.com/moses-smt/mosesdecoder.git
2.2 tokenizer 使用
进入tokenizer.perl
所在目录
cd mosesdecoder/scripts/tokenizer/
tokenizer.perl
参数如下:
Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile
Options:
-q ... quiet.
-a ... aggressive hyphen splitting.
-b ... disable Perl buffering.
-time ... enable processing time calculation.
-penn ... use Penn treebank-like tokenization.
-protected FILE ... specify file with patters to be protected in tokenisation.
-no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.
tokenizer 主要将标点与词分开,具体可以查看tokenizer.perl
例如文件 input.en:
Are you sure you want to cancel the upgrade?
Enemy's march trail's color will turn blue (originally red)
Clicking "Change Appearance" will replace your custom avatar with a default avatar.
运行
perl ./tokenizer.perl -l en -no-escape <input.en> tokenizedfile.en
得到:
Are you sure you want to cancel the upgrade ?
Enemy 's march trail 's color will turn blue ( originally red )
Clicking " Change Appearance " will replace your custom avatar with a default avatar .
注意:
- 需要加上 -no-escape,如果不加会得到下图中效果,其中 's " 等都会被转义
- 参数 -l 传入的语种 为英、德,传入不存在的语种会默认为 en 分词