4.Lucene创建索引

2022-07-13 17:24:40

一、指定一个存放索引的目录

这里我指定了一个绝对的位置，这个位置要能读写数据。

// lucene索引目录位置
String indexDir = "E:\\develop\\demo\\lucene-learn\\lucene-index";
File luceneIndexDirectory = new File(indexDir);

二、创建一个索引写入类IndexWriter，用于写入索引和文档对象到索引目录

打开Directory时，有多种实现，我们常用有：

FSDirectory：直接操作磁盘（固态推荐）

MMapDirectory：内存映射（大内存推荐）

优化的参数MaxBufferedDocs和forceMerge根据磁盘的读写速度自行调整。

// 打开索引目录
Directory fsd = MMapDirectory.open(luceneIndexDirectory.toPath());
// 创建一个索引写入类IndexWriter，用于写入索引和文档对象到索引目录
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new IKAnalyzer());
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
// 设置缓存区大小，每n个文档写入一次
indexWriterConfig.setMaxBufferedDocs(1000);
IndexWriter writer = new IndexWriter(fsd, indexWriterConfig);
// 设置每个segments保存几个文档，该值越大创建索引时间越小，对应的搜索会变慢
writer.forceMerge(100);

注意：IndexWriter 只能存在一个，在上一个没有关闭之前创建会抛出异常。

三、创建存放数据的类型，指定是否需要存储、是否分词、是否建立索引等

IndexOptions说明：

属性	说明
NONE	不建立索引
DOCS	文档建立索引
DOCS_AND_FREQS	文档、词频建立索引
DOCS_AND_FREQS_AND_POSITIONS	文档、词频、词位置建立索引
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS	文档、词频、词位置、偏移量建立索引

Stored：是否存储，如果存储了就可以在文档中读取内容，如果不存储就读取不到内容。

// 定义字段
// 1. id：存储、索引、不分词
FieldType idFieldType = new FieldType();
idFieldType.setStored(true);
idFieldType.setTokenized(false);
idFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);

// 2. title：存储、分词、索引
FieldType titleFieldType = new FieldType();
titleFieldType.setStored(true);
titleFieldType.setTokenized(true);
titleFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

// 3. content：存储、分词、索引
FieldType contentFieldType = new FieldType();
contentFieldType.setStored(true);
contentFieldType.setTokenized(true);
contentFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

三、创建存放数据的文档，把数据按照类型存放进去

如果有多个Document文档对象，创建一个 List<Document> ，把组装的Document写到 list 中即可。

// 创建文档对象，把字段内容写入到文档对象
Document document = new Document();
document.add(new Field("id", "1", idFieldType));
document.add(new Field("title", "IndexOptions类说明", titleFieldType));
document.add(new Field("content", "IndexOptions是在lucene-core-x.jar包下面，其作用是在新建索引时候选择索引属性。", contentFieldType));

四、把文档添加到到索引中

addDocument可保存单个Document，List<Document>也是可以直接保存的。

// 把文档添加到IndexWriter
writer.addDocument(document);

五、保存索引和数据到索引目录

添加后记得手动把缓存的数据保存到目录，不然会根据writer设置的参数进行缓存，到达阈值才触发保存操作，这样如果程序停止了，缓存中数据没保存的话就丢失了。

// 提交保存索引
writer.flush();
writer.commit();
writer.close();

六、使用luke查看创建的索引

我们把luke8.0.0下载下来，然后打开软件，选择索引目录即可查看。

附录：完整代码

@Test
public void buildIndex() {
    // lucene索引目录位置
    String indexDir = "E:\\develop\\demo\\lucene-learn\\lucene-index";
    File luceneIndexDirectory = new File(indexDir);
    // 打开索引目录
    // 可以使用 MMapDirectory（内存映射来加快查询速度，但是比较占用内储）
    //try (FSDirectory fsd = FSDirectory.open(luceneIndexDirectory.toPath())) {
    try (Directory fsd = MMapDirectory.open(luceneIndexDirectory.toPath())) {
        // 创建索引写入
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new IKAnalyzer());
        indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        // 设置缓存区大小，每n个文档写入一次
        indexWriterConfig.setMaxBufferedDocs(1000);

        IndexWriter writer = new IndexWriter(fsd, indexWriterConfig);
        // 设置每个segments保存几个文档，该值越大创建索引时间越小，对应的搜索会变慢
        writer.forceMerge(100);

        // 定义字段
        // 1. id：存储、索引、不分词
        FieldType idFieldType = new FieldType();
        idFieldType.setStored(true);
        idFieldType.setTokenized(false);
        idFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        // 2. title：存储、分词、索引
        FieldType titleFieldType = new FieldType();
        titleFieldType.setStored(true);
        titleFieldType.setTokenized(true);
        titleFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        // 3. content：存储、分词、索引
        FieldType contentFieldType = new FieldType();
        contentFieldType.setStored(true);
        contentFieldType.setTokenized(true);
        contentFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

        // 创建文档对象，把字段内容写入到文档对象
        Document document = new Document();
        document.add(new Field("id", "1", idFieldType));
        document.add(new Field("title", "IndexOptions类说明", titleFieldType));
        document.add(new Field("content", "IndexOptions是在lucene-core-x.jar包下面，其作用是在新建索引时候选择索引属性。", contentFieldType));

        // 把文档添加到IndexWriter
        writer.addDocument(document);

        // 提交保存索引
        writer.flush();
        writer.commit();
        writer.close();
    } catch (IOException e) {
        System.err.println("打开索引目录失败");
        e.printStackTrace();
    }
}