5 POS标注器
功能介绍:语音标记器的部分标记符号与基于符号本身和符号的上下文中它们的相应字类型。符号可能取决于符号和上下文使用多个POS标签。该OpenNLP POS标注器使用的概率模型来预测正确的POS标记出了标签组。为了限制可能的标记的符号标记字典可以使用这增加了捉人者的标记和运行时性能。
API:部分的词类打标签训练API支持一个新的POS模式的培训。三个基本步骤是必要的训练它:
-
应用程序必须打开一个示例数据流
-
调用POSTagger.train方法
-
保存POSModel到文件或数据库
在E盘新建一个 myText.txt 文件,内容为
Hi. How are you? This is Mike.
代码实现1:
package package01;
import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
public class Test05 {
public static void main(String[] args) {
try {
Test05.POSTag();
} catch (IOException e) {
e.printStackTrace();
}
}
/**
* 4.POS标注器:POS Tagger
* @deprecated Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP
*
* https://*.com/questions/50668754/the-constructor-plaintextbylinestreamstringreader-is-undefined
*/
public static void POSTag() throws IOException {
POSModel model = new POSModelLoader().load(new File("E:\\NLP_Practics\\models\\en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");//显示加载时间
POSTaggerME tagger = new POSTaggerME(model);
// String input = "Hi. How are you? This is Mike.";
// ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(input));
Charset charset = Charset.forName("UTF-8");
InputStreamFactory isf = new MarkableFileInputStreamFactory(new File("E:\\myText.txt"));
ObjectStream<String> lineStream = new PlainTextByLineStream(isf, charset);
perfMon.start();
String line;
while ((line = lineStream.read()) != null) {
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
System.out.println(sample.toString());
perfMon.incrementCounter();
}
perfMon.stopAndPrintFinalResult();
System.out.println("--------------4-------------");
lineStream.close();
}
}
结果
Loading POS Tagger model ... done (0.566s) Hi._NNP How_WRB are_VBP you?_JJ This_DT is_VBZ Mike._NNP --------------4------------- Average: 125.0 sent/s Total: 1 sent Runtime: 0.008s
代码实现2:
package package01; import opennlp.tools.cmdline.PerformanceMonitor; import opennlp.tools.cmdline.postag.POSModelLoader; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSSample; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.tokenize.SimpleTokenizer; import opennlp.tools.tokenize.WhitespaceTokenizer; import opennlp.tools.util.*; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.nio.charset.Charset; public class Test05 { public static void main(String[] args) throws IOException { Test05.POSMaxent("Hi. How are you? This is Mike."); } /** * OpenNLP词性标注工具的例子:最大熵词性标注器pos-maxent * JJ形容词、JJS形容词*、JJR形容词比较级 * RB副词、RBR副词*、RBS副词比较级 * DT限定词 * NN名称、NNS名称复试、NNP专有名词、NNPS专有名词复数: * PRP:人称代词、PRP$:物主代词 * VB动词不定式、VBD过去式、VBN过去分词、VBZ现在人称第三人称单数、VBP现在非第三人称、VBG动名词或现在分词 */ public static void POSMaxent(String str) throws InvalidFormatException, IOException { //给出词性模型所在的路径 File posModeFile = new File("E:\\NLP_Practics\\models\\en-pos-maxent.bin"); FileInputStream posModeStream = new FileInputStream(posModeFile); POSModel model = new POSModel(posModeStream); //将句子切分成词 POSTaggerME tagger = new POSTaggerME(model); String[] words = SimpleTokenizer.INSTANCE.tokenize(str); //将切好的词的句子传递给标注器 String[] result = tagger.tag(words); for (int i = 0; i < words.length; i++) { System.out.print(words[i] + "/" + result[i] + " "); } System.out.println();//结果: Hi/UH ./. How/WRB are/VBP you/PRP ?/. This/DT is/VBZ Mike/NNP ./. posModeStream.close(); } }
结果
结果: Hi/UH ./. How/WRB are/VBP you/PRP ?/. This/DT is/VBZ Mike/NNP ./.