You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by yi...@apache.org on 2023/06/25 11:13:47 UTC

[doris] branch master updated: [Improvement](doc) improve ngram and inverted index documents #21091

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 69d5adaee3 [Improvement](doc) improve ngram and inverted index documents #21091
69d5adaee3 is described below

commit 69d5adaee3cb1e2595c586a9f984a481de239a1c
Author: Kang <kx...@gmail.com>
AuthorDate: Sun Jun 25 19:13:41 2023 +0800

    [Improvement](doc) improve ngram and inverted index documents #21091
---
 docs/en/docs/data-table/index/inverted-index.md    | 12 +++++------
 .../data-table/index/ngram-bloomfilter-index.md    |  2 +-
 docs/zh-CN/docs/data-table/index/inverted-index.md | 24 +++++++++++-----------
 .../data-table/index/ngram-bloomfilter-index.md    |  2 +-
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/docs/en/docs/data-table/index/inverted-index.md b/docs/en/docs/data-table/index/inverted-index.md
index 57216d8ad4..331fa90491 100644
--- a/docs/en/docs/data-table/index/inverted-index.md
+++ b/docs/en/docs/data-table/index/inverted-index.md
@@ -74,15 +74,15 @@ The features for inverted index is as follows:
       - missing stands for no parser, the whole field is considered to be a term
       - "english" stands for english parser
       - "chinese" stands for chinese parser
-      - "unicode" stands for mixed-type word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters into 1-gram.
+      - "unicode" stands for muti-language mixed word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters one by one.
 
     - "parser_mode" is utilized to set the tokenizer/parser type for Chinese word segmentation.
-      - in "fine_grained" mode, the system will meticulously tokenize each possible segment.
-      - in "coarse_grained" mode, the system follows the maximization principle, performing accurate and comprehensive tokenization.
+      - in "fine_grained" mode, the system tend to generate short words, eg. 6 words '武汉' '武汉市' '市长' '长江' '长江大桥' '大桥' for '武汉长江大桥'.
+      - in "coarse_grained" mode, the system tend to generate long words, eg. 2 words '武汉市' '市长' '长江大桥' for '武汉长江大桥'.
       - default mode is "coarse_grained".
-    - "support_phrase" is utilized to specify if the index requires support for phrase mode. 
-      - "true" indicates that support is needed.
-      - "false" indicates that support is not needed.
+    - "support_phrase" is utilized to specify if the index requires support for phrase mode query MATCH_PHRASE
+      - "true" indicates that support is needed, but needs more storage for index.
+      - "false" indicates that support is not needed, and less storage for index. MATCH_ALL can be used for matching multi words without order.
       - default mode is "false".
   - COMMENT is optional
 
diff --git a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
index e3e04eb315..d804c28b7e 100644
--- a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
+++ b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
@@ -29,7 +29,7 @@ under the License.
 <version since="2.0.0">
 </version>
 
-In order to improve the like query performance, the NGram BloomFilter index was implemented, which referenced to the ClickHouse's ngrambf skip indices;
+In order to improve the like query performance, the NGram BloomFilter index was implemented.
 
 ## Create Column With NGram BloomFilter Index
 
diff --git a/docs/zh-CN/docs/data-table/index/inverted-index.md b/docs/zh-CN/docs/data-table/index/inverted-index.md
index 3ac4992519..15f7485d8e 100644
--- a/docs/zh-CN/docs/data-table/index/inverted-index.md
+++ b/docs/zh-CN/docs/data-table/index/inverted-index.md
@@ -52,7 +52,7 @@ Doris倒排索引的功能简要介绍如下:
 - 增加了字符串类型的全文检索
   - 支持字符串全文检索,包括同时匹配多个关键字MATCH_ALL、匹配任意一个关键字MATCH_ANY、匹配短语词组MATCH_PHRASE
   - 支持字符串数组类型的全文检索
-  - 支持英文、中文以及混合类型分词
+  - 支持英文、中文以及Unicode多语言分词
 - 加速普通等值、范围查询,覆盖bitmap索引的功能,未来会代替bitmap索引
   - 支持字符串、数值、日期时间类型的 =, !=, >, >=, <, <= 快速过滤
   - 支持字符串、数字、日期时间数组类型的 =, !=, >, >=, <, <=
@@ -72,16 +72,16 @@ Doris倒排索引的功能简要介绍如下:
     - parser指定分词器
       - 默认不指定代表不分词
       - english是英文分词,适合被索引列是英文的情况,用空格和标点符号分词,性能高
-      - chinese是中文分词,适合被索引列有中文或者中英文混合的情况,性能比english分词低
-      - unicode是混合类型分词,适用于中英文混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文字符进行1-gram分词。
-    - parser_mode用于指定中文分词的模式
-      - fine_grained模式,系统将对可以进行分词的部分都进行详尽的分词处理
-      - coarse_grained模式,系统则依据最大化原则,执行精确且全面的分词操作
-      - 默认coarse_grained模式
-    - support_phrase用于指定索引是否需要支持短语模式
-      - true为需要
-      - false为不需要
-      - 默认false不需要
+      - chinese是中文分词,适合被索引列主要是中文的情况,性能比english分词低
+      - unicode是多语言混合类型分词,适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文按字符分词。
+    - parser_mode用于指定分词的模式,目前parser = chinese时支持如下几种模式:
+      - fine_grained:细粒度模式,倾向于分出比较短的词,比如 '武汉长江大桥' 会分成 '武汉', '武汉市', '市长', '长江', '长江大桥', '大桥' 6个词
+      - coarse_grained:粗粒度模式,倾向于分出比较长的词,,比如 '武汉长江大桥' 会分成 '武汉市' '长江大桥' 2个词
+      - 默认coarse_grained
+    - support_phrase用于指定索引是否支持MATCH_PHRASE短语查询加速
+      - true为支持,但是索引需要更多的存储空间
+      - false为不支持,更省存储空间,可以用MATCH_ALL查询多个关键字
+      - 默认false
   - COMMENT 是可选的,用于指定注释
 
 ```sql
@@ -150,7 +150,7 @@ USE test_inverted_index;
 
 -- 创建表的同时创建了comment的倒排索引idx_comment
 --   USING INVERTED 指定索引类型是倒排索引
---   PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文混合分词,如果不指定"parser"参数表示不分词
+--   PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文多语言混合分词,如果不指定"parser"参数表示不分词
 CREATE TABLE hackernews_1m
 (
     `id` BIGINT,
diff --git a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
index ea6304253a..27e2b23592 100644
--- a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
+++ b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
@@ -29,7 +29,7 @@ under the License.
 <version since="2.0.0">
 </version>
 
-为了提升like的查询性能,增加了NGram BloomFilter索引,其实现主要参照了ClickHouse的ngrambf。
+为了提升like的查询性能,增加了NGram BloomFilter索引。
 
 ## NGram BloomFilter创建
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org