Build a specialized fields Chinese full-text search engine with Elasticsearch/IKAnalysis/jieba

Background

Many projects often need a search engine, but for Chinese content, we can not simply use MySQL like to search, especially for the professional field Chinese sentence search, we need specialized words from corpus to bring better search result, because the Chinese words are not split with spaces, which requires us to do two things.

1 Extract specialized words/vocabulary from corpus existed, such as brand, character names, you can also polish your dict manually.
2 Import that specialized vocabulary into the search engine and the search engine(Elasticsearch) can index those keywords split by IK analyzer.

Prerequisite

Elasticsearch

Elasticsearch Analysis Plugin: IK

Jieba stutter, Chinese text segmentation tool

Other Option: https://github.com/NLPchina/ansj_seg

How to do it

Install Elasticsearch/Kibana/IKAnalysis with Docker

docker-compose.yml

version: '2.2'
services:
  kibana:
    image: docker.elastic.co/kibana/kibana:7.11.2
    volumes:
      - ./kibana.yml:/usr/share/kibana/config/kibana.yml
    networks:
      - elastic
    ports:
      - 9601:9601
      - 8893:8893
    environment:
      SERVER_NAME: kibana.example.org
      ELASTICSEARCH_HOSTS: http://172.17.0.3:9200
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.11.2
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - cluster.initial_master_nodes=es01
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms4g -Xmx4g"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
      - ./plugin/ik:/usr/share/elasticsearch/plugins/ik
    ports:
      - 0.0.0.0:9200:9200
    networks:
      - elastic

volumes:
  data01:
    driver: local

networks:
  elastic:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: 172.33.238.0/24

kibana.yml

server.port: 9601
server.host: "0.0.0.0"
enterpriseSearch.host: "http://172.17.0.3:9200"
elasticsearch.hosts: "http://172.17.0.3:9200"

How directory looks like is below
We need download elasticsearch-analysis-ik-7.11.2.zip and unzip it in directory plugin as we mapped it in volumes in file docker-compose.yml

./
├── docker-compose.yml
├── kibana.yml
└── plugin
    └── ik
        ├── commons-codec-1.9.jar
        ├── commons-logging-1.2.jar
        ├── config
        │   ├── extra_main.dic
        │   ├── extra_single_word.dic
        │   ├── extra_single_word_full.dic
        │   ├── extra_single_word_low_freq.dic
        │   ├── extra_stopword.dic
        │   ├── Cosmetics.dic
        │   ├── IKAnalyzer.cfg.xml
        │   ├── main.dic
        │   ├── preposition.dic
        │   ├── quantifier.dic
        │   ├── stopword.dic
        │   ├── suffix.dic
        │   └── surname.dic
        ├── elasticsearch-analysis-ik-7.11.2.jar
        ├── elasticsearch-analysis-ik-7.11.2.zip
        ├── httpclient-4.5.2.jar
        ├── httpcore-4.4.4.jar
        ├── plugin-descriptor.properties
        └── plugin-security.policy

Cosmetics.dic is the file that contains Chinese word that need to be tokenized.
here is the example of some cosmetics brands

烟酰胺
抗蓝光
康萃乐
怡丽丝尔

Segmentation (get Cosmetics.dic from corpus)

Here we need Jieba Segmentation Tool which is intent to built the best Python Chinese word segmentation module.
corpus need to be ready in with txt format.
the corpus should contains specialized Chinese words

1
2
3

git clone github.com/fxsjy/jieba
cd jieba/test
python extract_tags_with_weight.py chinese-specialized-corpus.txt -k 10000 -w 1

take cosmetics corpus as an example, the result should like below, those result can import to Cosmetics.dic.

if possible, with manually edited and refined keyword brings better search experience

tag: 50ml		 weight: 0.213126
tag: 雅诗兰黛		 weight: 0.155035
tag: ysl		 weight: 0.144286
tag: 新版		 weight: 0.143799
tag: 面霜		 weight: 0.139773
tag: 100ml		 weight: 0.132471
tag: 阿玛尼		 weight: 0.116844
tag: 兰蔻		 weight: 0.114584
tag: 粉底液		 weight: 0.113962
tag: 香奈儿		 weight: 0.108075
tag: 精华		 weight: 0.104448
tag: 30ml		 weight: 0.100020
tag: sk2		 weight: 0.099609
tag: 科颜氏		 weight: 0.091035
tag: 面膜		 weight: 0.089536
tag: 洁面		 weight: 0.086686

IK Configration

plugin/ik/config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--Users can configure their own extended dictionary here-->
	<entry key="ext_dict">/root/plugin/ik/config/huazhuangping.dic</entry>
	 <!--Users can configure their own extended stop word dictionary here-->
	<entry key="ext_stopwords"></entry>
	<!--Users can configure the remote extension link here -->
	<entry key="remote_ext_dict">https://xxxxxx/word_for_elasticsearch.txt</entry>
	<!--The user can configure the remote extension stop word dictionary here-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

Test IK Tokenazier

Open kibana dev http://kibana.xxx.com/app/dev_tools#/console

POST yunzhaohuo_index/_analyze
{
    "analyzer": "ik_smart",
    "text": "纪梵希润唇抗蓝光"  
}

Here, we can see the parser has take effective, and the text above has been split into three different Chinese words

{
  "tokens" : [
    {
      "token" : "纪梵希",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "润唇",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "抗蓝光",
      "start_offset" : 5,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

ES Configrations

1: Create Index

run in Shell to create index

1	curl -XPUT http://localhost:9200/product_inventory

2: Create mapping for index with ik_smart

Suppose we are creating a search engine for ecommercial with field Id,product_desc
and product_desc is the field that we want to search with full-text.

curl -XPOST http://localhost:9200/product_inventory/_mapping -H 'Content-Type:application/json' -d'{
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "Id" : {
          "type" : "long"
        },
        "product_desc" : {
          "type" : "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        },
        "price" : {
          "type" : "long"
        },
        "updateTime" : {
          "type" : "date",
          "format" : "yyyy-MM-dd HH:mm:ss"
        }
      }
 }'

3: Insert Data to ES

Run in kiaban Dev Console

POST /product_inventory/_create/1
{
    "product_desc":"抗蓝光洗面奶",
    "price":"3",
    "updateTime":"2021-01-03 12:22:11"
}

4: Search Data from ES

Search with _score sorted, you will see those result with _score.

POST /yunzhaohuo_index/_search?size=995&from=0
{
  
    "query" : {
      "match": {
         "product_desc": "抗蓝光"
      }
    },
    "sort": {
      "_score" : "desc"
    }
}