我正在为项目使用Lucene,并且需要自定义分析器.
代码是:
public class MyCommentAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {
Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
TokenStream filter = new StandardFilter( Version.LUCENE_48, source );
filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );
return new TokenStreamComponents( source, filter );
}
}
我已经建立好了,但是现在我不能继续了.我的需求是筛选器只能选择某些单词.与使用停用词相比,这是相反的过程:不要从单词表中删除,而只能在单词表中添加术语.就像预建的字典一样.
因此StopFilter不会填充目标. Lucene提供的所有过滤器似乎都不是很好.
我想我需要编写自己的过滤器,但不知道如何.
有什么建议吗?
解决方法:
您应该从StopFilter寻找起点,所以read the source!
StopFilter的大部分来源都是用于构建Stopset的所有便捷方法.您可以放心地忽略所有这些内容(除非您想保留它来构建保持集).
剪切所有内容,然后StopFilter归结为:
public final class StopFilter extends FilteringTokenFilter {
private final CharArraySet stopWords;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
super(matchVersion, in);
this.stopWords = stopWords;
}
@Override
protected boolean accept() {
return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}
FilteringTokenFilter是一个非常简单的类.关键只是accept方法.当前术语被调用时,如果返回true,则将该术语添加到输出流中.如果返回false,则放弃当前项.
因此,您真正需要在StopFilter中进行更改的唯一事情是删除单个字符,以使accept返回与当前操作相反的状态.同样,在这里和那里更改一些名称也不会受到伤害.
public final class KeepOnlyFilter extends FilteringTokenFilter {
private final CharArraySet keepWords;
private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
super(matchVersion, in);
this.keepWords = keepWords;
}
@Override
protected boolean accept() {
return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}