Background

Many projects often need a search engine, but for Chinese content, we can not simply use MySQL like to search, especially for the professional field Chinese sentence search, we need specialized words from corpus to bring better search result, because the Chinese words are not split with spaces, which requires us to do two things.

  • 1 Extract specialized words/vocabulary from corpus existed, such as brand, character names, you can also polish your dict manually.
  • 2 Import that specialized vocabulary into the search engine and the search engine(Elasticsearch) can index those keywords split by IK analyzer.

Prerequisite

Elasticsearch

Elasticsearch Analysis Plugin: IK

Jieba stutter, Chinese text segmentation tool

Other Option: https://github.com/NLPchina/ansj_seg

How to do it

Install Elasticsearch/Kibana/IKAnalysis with Docker

docker-compose.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
version: '2.2'
services:
kibana:
image: docker.elastic.co/kibana/kibana:7.11.2
volumes:
- ./kibana.yml:/usr/share/kibana/config/kibana.yml
networks:
- elastic
ports:
- 9601:9601
- 8893:8893
environment:
SERVER_NAME: kibana.example.org
ELASTICSEARCH_HOSTS: http://172.17.0.3:9200
es01:
image: docker.elastic.co/elasticsearch/elasticsearch:7.11.2
container_name: es01
environment:
- node.name=es01
- cluster.name=es-docker-cluster
- cluster.initial_master_nodes=es01
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms4g -Xmx4g"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- data01:/usr/share/elasticsearch/data
- ./plugin/ik:/usr/share/elasticsearch/plugins/ik
ports:
- 0.0.0.0:9200:9200
networks:
- elastic

volumes:
data01:
driver: local

networks:
elastic:
driver: bridge
ipam:
driver: default
config:
- subnet: 172.33.238.0/24

kibana.yml

1
2
3
4
server.port: 9601
server.host: "0.0.0.0"
enterpriseSearch.host: "http://172.17.0.3:9200"
elasticsearch.hosts: "http://172.17.0.3:9200"

How directory looks like is below
We need download elasticsearch-analysis-ik-7.11.2.zip and unzip it in directory plugin as we mapped it in volumes in file docker-compose.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
./
├── docker-compose.yml
├── kibana.yml
└── plugin
└── ik
├── commons-codec-1.9.jar
├── commons-logging-1.2.jar
├── config
│   ├── extra_main.dic
│   ├── extra_single_word.dic
│   ├── extra_single_word_full.dic
│   ├── extra_single_word_low_freq.dic
│   ├── extra_stopword.dic
│   ├── Cosmetics.dic
│   ├── IKAnalyzer.cfg.xml
│   ├── main.dic
│   ├── preposition.dic
│   ├── quantifier.dic
│   ├── stopword.dic
│   ├── suffix.dic
│   └── surname.dic
├── elasticsearch-analysis-ik-7.11.2.jar
├── elasticsearch-analysis-ik-7.11.2.zip
├── httpclient-4.5.2.jar
├── httpcore-4.4.4.jar
├── plugin-descriptor.properties
└── plugin-security.policy

Cosmetics.dic is the file that contains Chinese word that need to be tokenized.
here is the example of some cosmetics brands

1
2
3
4
烟酰胺
抗蓝光
康萃乐
怡丽丝尔

Segmentation (get Cosmetics.dic from corpus)

Here we need Jieba Segmentation Tool which is intent to built the best Python Chinese word segmentation module.
corpus need to be ready in with txt format.
the corpus should contains specialized Chinese words

1
2
3
git clone github.com/fxsjy/jieba
cd jieba/test
python extract_tags_with_weight.py chinese-specialized-corpus.txt -k 10000 -w 1

take cosmetics corpus as an example, the result should like below, those result can import to Cosmetics.dic.

if possible, with manually edited and refined keyword brings better search experience

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tag: 50ml		 weight: 0.213126
tag: 雅诗兰黛 weight: 0.155035
tag: ysl weight: 0.144286
tag: 新版 weight: 0.143799
tag: 面霜 weight: 0.139773
tag: 100ml weight: 0.132471
tag: 阿玛尼 weight: 0.116844
tag: 兰蔻 weight: 0.114584
tag: 粉底液 weight: 0.113962
tag: 香奈儿 weight: 0.108075
tag: 精华 weight: 0.104448
tag: 30ml weight: 0.100020
tag: sk2 weight: 0.099609
tag: 科颜氏 weight: 0.091035
tag: 面膜 weight: 0.089536
tag: 洁面 weight: 0.086686

IK Configration

plugin/ik/config/IKAnalyzer.cfg.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--Users can configure their own extended dictionary here-->
<entry key="ext_dict">/root/plugin/ik/config/huazhuangping.dic</entry>
<!--Users can configure their own extended stop word dictionary here-->
<entry key="ext_stopwords"></entry>
<!--Users can configure the remote extension link here -->
<entry key="remote_ext_dict">https://xxxxxx/word_for_elasticsearch.txt</entry>
<!--The user can configure the remote extension stop word dictionary here-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

Test IK Tokenazier

Open kibana dev http://kibana.xxx.com/app/dev_tools#/console

1
2
3
4
5
POST yunzhaohuo_index/_analyze
{
"analyzer": "ik_smart",
"text": "纪梵希润唇抗蓝光"
}

Here, we can see the parser has take effective, and the text above has been split into three different Chinese words

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"tokens" : [
{
"token" : "纪梵希",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "润唇",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "抗蓝光",
"start_offset" : 5,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 2
}
]
}

ES Configrations

1: Create Index

run in Shell to create index

1
curl -XPUT http://localhost:9200/product_inventory

2: Create mapping for index with ik_smart

Suppose we are creating a search engine for ecommercial with field Id,product_desc
and product_desc is the field that we want to search with full-text.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
curl -XPOST http://localhost:9200/product_inventory/_mapping -H 'Content-Type:application/json' -d'{
"properties" : {
"@timestamp" : {
"type" : "date"
},
"Id" : {
"type" : "long"
},
"product_desc" : {
"type" : "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"price" : {
"type" : "long"
},
"updateTime" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss"
}
}
}'

3: Insert Data to ES

Run in kiaban Dev Console

1
2
3
4
5
6
POST /product_inventory/_create/1
{
"product_desc":"抗蓝光洗面奶",
"price":"3",
"updateTime":"2021-01-03 12:22:11"
}

4: Search Data from ES

Search with _score sorted, you will see those result with _score.

1
2
3
4
5
6
7
8
9
10
11
12
POST /yunzhaohuo_index/_search?size=995&from=0
{

"query" : {
"match": {
"product_desc": "抗蓝光"
}
},
"sort": {
"_score" : "desc"
}
}