Many projects often need a search engine, but for Chinese content, we can not simply use MySQL like to search, especially for the professional field Chinese sentence search, we need specialized words from corpus to bring better search result, because the Chinese words are not split with spaces, which requires us to do two things.
1 Extract specialized words/vocabulary from corpus existed, such as brand, character names, you can also polish your dict manually.
2 Import that specialized vocabulary into the search engine and the search engine(Elasticsearch) can index those keywords split by IK analyzer.
How directory looks like is below We need download elasticsearch-analysis-ik-7.11.2.zip and unzip it in directory plugin as we mapped it in volumes in file docker-compose.yml
Cosmetics.dic is the file that contains Chinese word that need to be tokenized. here is the example of some cosmetics brands
1 2 3 4
烟酰胺 抗蓝光 康萃乐 怡丽丝尔
Segmentation (get Cosmetics.dic from corpus)
Here we need Jieba Segmentation Tool which is intent to built the best Python Chinese word segmentation module. corpus need to be ready in with txt format. the corpus should contains specialized Chinese words
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> <properties> <comment>IK Analyzer 扩展配置</comment> <!--Users can configure their own extended dictionary here--> <entry key="ext_dict">/root/plugin/ik/config/huazhuangping.dic</entry> <!--Users can configure their own extended stop word dictionary here--> <entry key="ext_stopwords"></entry> <!--Users can configure the remote extension link here --> <entry key="remote_ext_dict">https://xxxxxx/word_for_elasticsearch.txt</entry> <!--The user can configure the remote extension stop word dictionary here--> <!-- <entry key="remote_ext_stopwords">words_location</entry> --> </properties>
Suppose we are creating a search engine for ecommercial with field Id,product_desc and product_desc is the field that we want to search with full-text.