ElasticSearch（二）在ElasticSearch 中使用中文分词器

12/2/2021

- 存储

作者：lomtom

个人网站：lomtom.cn 🔗

个人公众号：博思奥园 🔗

你的支持就是我最大的动力。

ES系列：

IK分词器对中文具有良好支持的分词器，相比于ES自带的分词器，IK分词器更能适用中文博大精深的语言环境.

分析器：ik_smart，，分词器ik_max_word：ik_smart，ik_max_word

自 v5.0.0 起移除名为 ik 的analyzer和tokenizer,请分别使用 ik_smart 和 ik_max_word

1、下载：

方式一、

分词器官网：https://github.com/medcl/elasticsearch-analysis-ik。 🔗
在 https://github.com/medcl/elasticsearch-analysis-ik/releases 🔗 可以直接根据该链接下载：https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.10.1/elasticsearch-analysis-ik-7.10.1.zip 🔗
将下载文件解压。
在 es/plugins 目录下，新建 ik 目录，并将解压后的所有文件拷贝到 ik 目录下。
重启 es 服务。

注意：需要和自己的ES的版本对应。

IK版本	ES版
主	7.x->主
6.x	6.x
5.x	5.x
1.10.6	2.4.6
1.9.5	2.3.5
1.8.1	2.2.1
1.7.0	2.1.1
1.5.0	2.0.0
1.2.6	1.0.0
1.2.5	0.90.x
1.1.3	0.20.x
1.0.0	0.16.2-> 0.19.0

方式二：

使用elasticsearch-plugin进行安装（v5.5.1版本支持）：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip
注意：替换6.3.0为您自己的elasticsearch版本

2、测试：

1、首先建立一个索引

PUT
http://localhost:9200/test

2、利用该索引中进行分词测试：结果：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "自己",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "长得",
            "start_offset": 3,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "可",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "真",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "好看",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 5
        }
    ]
}

3、自定义词库：

POST	http://localhost:9200/test/_analyze
{
    "analyzer" : "ik_smart" ,
    "text" :"公众号博思奥园"
}

默认情况下，没有我们自定义的词库，他会将博思奥园拆分开，如果我们不想将他拆开，我们可以自定义词库。

{
    "tokens": [
        {
            "token": "公众",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "号",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "博",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "思",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "奥",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "园",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 5
        }
    ]
}

1、在ik/config目录下新建一个词典文件myext.dic，加入自己所需要的词语

公众号
博思奥园

2、在ik/config/IKAnalyzer.cfg.xml 文件中配置远程扩展词接口：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">myext.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

3、当我们再次请求该接口时，就会得到我们想要的结果

{
    "tokens": [
        {
            "token": "公众号",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "博思奥园",
            "start_offset": 3,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

4、拓展使用热更新 IK 分词

目前该插件支持热更新 IK 分词，通过上文在 IK 配置文件中提到的如下配置

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>

其中 location 是指一个 url，比如 http://localhost:8080/myext.dic，该请求只需满足以下两点即可完成分词热更新。

该 http 请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库。

该 http 请求返回的内容格式是一行一个分词，换行符用 \n 即可。

满足上面两点要求就可以实现热更新分词了，不需要重启 ES 实例。

可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里，放在 nginx 或其他简易 http server 下，当 .txt 文件修改时，http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇，并更新这个 .txt 文件。

4、对比

ES自己也会自己的默认的分词器，那么我们可以将ES自带的分词器进行对比

同样利用test索引，分别使用es自带的分词器、ik提供的分词器

POST	http://localhost:9200/test/_analyze
1、es自带
{
    "analyzer" : "standard" ,
    "text" :"公众号博思奥园，中国文化博大精深"
}
2、ik提供的ik_max_word
{
    "analyzer" : "ik_max_word" ,
    "text" :"公众号博思奥园，中国文化博大精深"
}
3、IK提供的ik_smart，加上自定义词典
{
    "analyzer" : "ik_smart" ,
    "text" :"公众号博思奥园，中国文化博大精深"
}

三次结果分别是：通过分析可以看到

es提供的分词器对于中文的分词并不是那么友好，将所有的文字都拆开了。
而ik分词器就能够很友好的识别成语，更好的体会中国文化的博大精深，而ik_max_word与ik_smart 之间的差别

ik_max_word会将文本做最细粒度的拆分； ik_smart 会做最粗粒度的拆分

1、standard
{
    "tokens": [
        {
            "token": "公",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "众",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "号",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "博",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "思",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "奥",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "园",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "中",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
            "token": "文",
            "start_offset": 10,
            "end_offset": 11,
            "type": "<IDEOGRAPHIC>",
            "position": 9
        },
        {
            "token": "化",
            "start_offset": 11,
            "end_offset": 12,
            "type": "<IDEOGRAPHIC>",
            "position": 10
        },
        {
            "token": "博",
            "start_offset": 12,
            "end_offset": 13,
            "type": "<IDEOGRAPHIC>",
            "position": 11
        },
        {
            "token": "大",
            "start_offset": 13,
            "end_offset": 14,
            "type": "<IDEOGRAPHIC>",
            "position": 12
        },
        {
            "token": "精",
            "start_offset": 14,
            "end_offset": 15,
            "type": "<IDEOGRAPHIC>",
            "position": 13
        },
        {
            "token": "深",
            "start_offset": 15,
            "end_offset": 16,
            "type": "<IDEOGRAPHIC>",
            "position": 14
        }
    ]
}

2、ik_max_word
{
    "tokens": [
        {
            "token": "公众",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "号",
            "start_offset": 2,
            "end_offset": 3,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "博",
            "start_offset": 3,
            "end_offset": 4,
            "type": "CN_CHAR",
            "position": 2
        },
        {
            "token": "思",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 3
        },
        {
            "token": "奥",
            "start_offset": 5,
            "end_offset": 6,
            "type": "CN_CHAR",
            "position": 4
        },
        {
            "token": "园",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 5
        },
        {
            "token": "中国文化",
            "start_offset": 8,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "中国",
            "start_offset": 8,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国文",
            "start_offset": 9,
            "end_offset": 11,
            "type": "CN_WORD",
            "position": 8
        },
        {
            "token": "文化",
            "start_offset": 10,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 9
        },
        {
            "token": "博大精深",
            "start_offset": 12,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 10
        },
        {
            "token": "博大",
            "start_offset": 12,
            "end_offset": 14,
            "type": "CN_WORD",
            "position": 11
        },
        {
            "token": "精深",
            "start_offset": 14,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 12
        }
    ]
}

3、ik_smart
{
    "tokens": [
        {
            "token": "公众号",
            "start_offset": 0,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "博思奥园",
            "start_offset": 3,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中国文化",
            "start_offset": 8,
            "end_offset": 12,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "博大精深",
            "start_offset": 12,
            "end_offset": 16,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

标题：ElasticSearch（二）在ElasticSearch 中使用中文分词器

作者：lomtom

链接：https://lomtom.cn/2face247

ElasticSearch（二）在ElasticSearch 中使用中文分词器

1、下载：

2、测试：

3、自定义词库：

4、对比

Similar Posts

ElasticSearch（一）ElasticSearch入门

ElasticSearch（三）ElasticSearch索引操作

ElasticSearch（五）ElasticSearch字段类型