Lucene.net(4.8.0)+PanGu分词器问题记录一:分词器Analyzer的构造和内部成员ReuseStategy

2021-09-27 21:29:07

前言：目前自己在做使用Lucene.net和PanGu分词实现全文检索的工作，不过自己是把别人做好的项目进行迁移。因为项目整体要迁移到ASP.NET Core 2.0版本,而Lucene使用的版本是3.6.0 ，PanGu分词也是对应Lucene3.6.0版本的。不过好在Lucene.net 已经有了Core 2.0版本，4.8.0 bate版，而PanGu分词，目前有人正在做，貌似已经做完，只是还没有测试~，Lucene升级的改变我都会加粗表示。

Lucene.net 4.8.0

https://github.com/apache/lucenenet

PanGu分词(可以直接使用的)

https://github.com/SilentCC/Lucene.Net.Analysis.PanGu

JIEba分词(可以直接使用的)

https://github.com/SilentCC/JIEba-netcore2.0

Lucene.net 4.8.0 和之前的Lucene.net 3.6.0 改动还是相当多的，这里对自己开发过程遇到的问题，做一个记录吧，希望可以帮到和我一样需要升级Lucene.net的人。我也是第一次接触Lucene ,也希望可以帮助初学Lucene的同学。

一，Lucene 分词器：Analyzer

这里就对Lucene的Analyzer做一个简单的阐述，以后会对Analyzer做一个更加详细的笔记：Lucene 中的Analyzer 是一个分词器，具体的作用呢就是将文本（包括要写入索引的文档，和查询的条件）进行分词操作 Tokenization 得到一系列的分词 Token。我们用的别的分词工具，比如PanGu分词，都是继承Analyzer 的，并且继承相关的类和覆写相关的方法。Analyzer 是怎么参与搜索的过程呢？

1.在写入索引的时候：

我们需要IndexWriter ,二IndexWriter 的构建，补充一下，Lucene3.6.0 的构造方法已经被抛弃了，新的构造方法是，依赖一个IndexWriterConfig 类，这记录的是IndexWriter 的各种属性和配置，这里不做细究了。IndexWriterConfig 的构造函数就要传入一个Analyzer .

IndexWriterConfig(Version matchVersion, Analyzer analyzer)

所以我们写入索引的时候，会用到Analyzer , 写入的索引是这样一个借口，索引的储存方式是Document 类，一个Document类中有很多的Field (name, value)。我们可以这样理解Document是是一个数据库中的表，Field是数据库的中的字段。比如一篇文章，我们要把它存入索引，以便后来有人可以搜索到。

文章有很多属性：Title : xxx ; Author :xxxx;Content : xxxx;

document.Add(new Field("Title","Lucene"));

document.Add(new Field("Author","dacc"));

document.Add(new Field("Content","xxxxxx"));

IndexWriter.AddDocument(document);

大抵是上面的过程，而分词器Analyzer需要做的就是Filed 的value进行分词，把很长的内容分成一个一个的小分词 Token。

2.在查询搜索的时候，

我们也需要Analyzer ,当然不是必须需要，和IndexWriter的必须要求不一样。Analyzer的职责就是，将查询的内容进行分词，比如我们查询的内容是 “全文检索和分词” ，那么Analyzer会把它先分解成“全文检索”和“分词”，然后在索引中，去找和有这些分词的Field ,然后把Field所在的Document，返回出去。这里搜索的细节在这里不细究了，以后也会做详细的笔记。

二，问题：

大概了解了Analyzer之后，我就列出我遇到的问题：

1.在调用Analyer的GetTokenStream 之后，抛出

Object reference not set to an instance of an object

这个异常的意思是，引用了值为null的对象。于是我去翻找源码，发现

  public TokenStream GetTokenStream(string fieldName, TextReader reader)

        {

            TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);

            TextReader r = InitReader(fieldName, reader);

            if (components == null)

            {

                components = CreateComponents(fieldName, r);

                reuseStrategy.SetReusableComponents(this, fieldName, components);

            }

            else

            {

                components.SetReader(r);

            }

            return components.TokenStream;

        }

在下面这条语句上面抛出了错误：

    TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);

reuseStrategy 是一个空对象。所以这句就报错了。这里，我们可以了解一下，Analyzer的内部.函数 GetTokenStream 是返回Analyzer中的TokenStream，TokenStream是一系列Token的集合。先不细究TokenStream的具体作用，因为会花很多的篇幅去说。而获取TokenStream 的关键就在reuseStrategy 。在新版本的Lucene中，Analyzer中TokenStream是可以重复使用的，即在一个线程中建立的Analyzer实例，都共用TokenStream。

 internal DisposableThreadLocal<object> storedValue = new DisposableThreadLocal<object>();

Analyzer的成员 storedValue 是全局共用的，storedValue 中就储存了TokenStream 。而reuseStrategy也是Lucene3.6.0中没有的 的作用就是帮助实现，多个Analyzer实例共用storedValue 。ResuseStrategy类中有成员函数GetReusableComponents 和SetReusableComponents 是设置TokenStream和Tokenizer的，

这是ResueStrategy类的源码，这个类是一个抽象类，Analyzer的内部类，

 public abstract class ReuseStrategy

    {

        /// <summary>

        /// Gets the reusable <see cref="TokenStreamComponents"/> for the field with the given name.

        /// </summary>

        /// <param name="analyzer"> <see cref="Analyzer"/> from which to get the reused components. Use

        ///        <see cref="GetStoredValue(Analyzer)"/> and <see cref="SetStoredValue(Analyzer, object)"/>

        ///        to access the data on the <see cref="Analyzer"/>. </param>

        /// <param name="fieldName"> Name of the field whose reusable <see cref="TokenStreamComponents"/>

        ///        are to be retrieved </param>

        /// <returns> Reusable <see cref="TokenStreamComponents"/> for the field, or <c>null</c>

        ///         if there was no previous components for the field </returns>

        public abstract TokenStreamComponents GetReusableComponents(Analyzer analyzer, string fieldName);

        /// <summary>

        /// Stores the given <see cref="TokenStreamComponents"/> as the reusable components for the

        /// field with the give name.

        /// </summary>

        /// <param name="analyzer"> Analyzer </param>

        /// <param name="fieldName"> Name of the field whose <see cref="TokenStreamComponents"/> are being set </param>

        /// <param name="components"> <see cref="TokenStreamComponents"/> which are to be reused for the field </param>

        public abstract void SetReusableComponents(Analyzer analyzer, string fieldName, TokenStreamComponents components);

        /// <summary>

        /// Returns the currently stored value.

        /// </summary>

        /// <returns> Currently stored value or <c>null</c> if no value is stored </returns>

        /// <exception cref="ObjectDisposedException"> if the <see cref="Analyzer"/> is closed. </exception>

        protected internal object GetStoredValue(Analyzer analyzer)

        {

            if (analyzer.storedValue == null)

            {

                throw new ObjectDisposedException(this.GetType().GetTypeInfo().FullName, "this Analyzer is closed");

            }

            return analyzer.storedValue.Get();

        }

        /// <summary>

        /// Sets the stored value.

        /// </summary>

        /// <param name="analyzer"> Analyzer </param>

        /// <param name="storedValue"> Value to store </param>

        /// <exception cref="ObjectDisposedException"> if the <see cref="Analyzer"/> is closed. </exception>

        protected internal void SetStoredValue(Analyzer analyzer, object storedValue)

        {

            if (analyzer.storedValue == null)

            {

                throw new ObjectDisposedException("this Analyzer is closed");

            }

            analyzer.storedValue.Set(storedValue);

        }

    }

Analyzer 中的另一个内部类，继承了ReuseStrategy 抽象类。这两个类实现了设置Analyzer中的TokenStreamComponents和获取TokenStreamComponents 。这样的话Analyzer中GetTokenStream流程就清楚了

    public sealed class GlobalReuseStrategy : ReuseStrategy

        {

            /// <summary>

            /// Sole constructor. (For invocation by subclass constructors, typically implicit.) </summary>

            [Obsolete("Don't create instances of this class, use Analyzer.GLOBAL_REUSE_STRATEGY")]

            public GlobalReuseStrategy()

            { }

            public override TokenStreamComponents GetReusableComponents(Analyzer analyzer, string fieldName)

            {

                return (TokenStreamComponents)GetStoredValue(analyzer);

            }

            public override void SetReusableComponents(Analyzer analyzer, string fieldName, TokenStreamComponents components)

            {

                SetStoredValue(analyzer, components);

            }

        }

另外呢Analyzer 也可以设置TokenStream:

 public TokenStream GetTokenStream(string fieldName, TextReader reader)

                    {

                        //先获取上一次共用的TokenStreamComponents

                        TokenStreamComponents components = reuseStrategy.GetReusableComponents(this, fieldName);

                        TextReader r = InitReader(fieldName, reader);

                        //如果没有，就需要自己创建一个

                        if (components == null)

                        {

                            components = CreateComponents(fieldName, r);

                            //并且设置新的ResuableComponents，可以让下一个使用

                            reuseStrategy.SetReusableComponents(this, fieldName, components);

                        }

                        else

                        {

                            //如果之前就生成过了，TokenStreamComponents,则reset

                            components.SetReader(r);

                        }

                        //返回TokenStream

                        return components.TokenStream;

                    }

所以我们在调用Analyzer的时候，Analyzer有一个构造函数

  public Analyzer(ReuseStrategy reuseStrategy)

        {

            this.reuseStrategy = reuseStrategy;

        }

设置Analyzer 的 ReuseStrategy , 然后我发现在PanGu分词中，使用的构造函数中并没有传入ReuseStrategy , 按我们就需要自己建一个ReuseStrategy的实例。

PanGu分词的构造函数：

 public PanGuAnalyzer(bool originalResult)

          : this(originalResult, null, null)

        {

        }

        public PanGuAnalyzer(MatchOptions options, MatchParameter parameters)

            : this(false, options, parameters)

        {

        }

        public PanGuAnalyzer(bool originalResult, MatchOptions options, MatchParameter parameters)

            : base()

        {

            this.Initialize(originalResult, options, parameters);

        }

        public PanGuAnalyzer(bool originalResult, MatchOptions options, MatchParameter parameters, ReuseStrategy reuseStrategy)

            : base(reuseStrategy)

        {

            this.Initialize(originalResult, options, parameters);

        }

        protected virtual void Initialize(bool originalResult, MatchOptions options, MatchParameter parameters)

        {

            _originalResult = originalResult;

            _options = options;

            _parameters = parameters;

        }

我调用的是第二个构造函数，结果传进去的ReuseStrategy 是null ,所以我们需要新建实例，事实上Analyzer中已经为我们提供了：

public static readonly ReuseStrategy GLOBAL_REUSE_STRATEGY = new GlobalReuseStrategy()

所以稍微改动一下PanGu分词的构造函数就好了：

        public PanGuAnalyzer(MatchOptions options, MatchParameter parameters)

            : this(false, options, parameters, Lucene.Net.Analysis.Analyzer.GLOBAL_REUSE_STRATEGY)

        {

        }

码农公寓

一，Lucene 分词器：Analyzer

1.在写入索引的时候：

2.在查询搜索的时候，

二，问题：

1.在调用Analyer的GetTokenStream 之后，抛出

相关文章