You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by yi...@apache.org on 2023/06/25 11:13:47 UTC
[doris] branch master updated: [Improvement](doc) improve ngram and inverted index documents #21091
This is an automated email from the ASF dual-hosted git repository.
yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new 69d5adaee3 [Improvement](doc) improve ngram and inverted index documents #21091
69d5adaee3 is described below
commit 69d5adaee3cb1e2595c586a9f984a481de239a1c
Author: Kang <kx...@gmail.com>
AuthorDate: Sun Jun 25 19:13:41 2023 +0800
[Improvement](doc) improve ngram and inverted index documents #21091
---
docs/en/docs/data-table/index/inverted-index.md | 12 +++++------
.../data-table/index/ngram-bloomfilter-index.md | 2 +-
docs/zh-CN/docs/data-table/index/inverted-index.md | 24 +++++++++++-----------
.../data-table/index/ngram-bloomfilter-index.md | 2 +-
4 files changed, 20 insertions(+), 20 deletions(-)
diff --git a/docs/en/docs/data-table/index/inverted-index.md b/docs/en/docs/data-table/index/inverted-index.md
index 57216d8ad4..331fa90491 100644
--- a/docs/en/docs/data-table/index/inverted-index.md
+++ b/docs/en/docs/data-table/index/inverted-index.md
@@ -74,15 +74,15 @@ The features for inverted index is as follows:
- missing stands for no parser, the whole field is considered to be a term
- "english" stands for english parser
- "chinese" stands for chinese parser
- - "unicode" stands for mixed-type word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters into 1-gram.
+ - "unicode" stands for muti-language mixed word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters one by one.
- "parser_mode" is utilized to set the tokenizer/parser type for Chinese word segmentation.
- - in "fine_grained" mode, the system will meticulously tokenize each possible segment.
- - in "coarse_grained" mode, the system follows the maximization principle, performing accurate and comprehensive tokenization.
+ - in "fine_grained" mode, the system tend to generate short words, eg. 6 words '武汉' '武汉市' '市长' '长江' '长江大桥' '大桥' for '武汉长江大桥'.
+ - in "coarse_grained" mode, the system tend to generate long words, eg. 2 words '武汉市' '市长' '长江大桥' for '武汉长江大桥'.
- default mode is "coarse_grained".
- - "support_phrase" is utilized to specify if the index requires support for phrase mode.
- - "true" indicates that support is needed.
- - "false" indicates that support is not needed.
+ - "support_phrase" is utilized to specify if the index requires support for phrase mode query MATCH_PHRASE
+ - "true" indicates that support is needed, but needs more storage for index.
+ - "false" indicates that support is not needed, and less storage for index. MATCH_ALL can be used for matching multi words without order.
- default mode is "false".
- COMMENT is optional
diff --git a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
index e3e04eb315..d804c28b7e 100644
--- a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
+++ b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
@@ -29,7 +29,7 @@ under the License.
<version since="2.0.0">
</version>
-In order to improve the like query performance, the NGram BloomFilter index was implemented, which referenced to the ClickHouse's ngrambf skip indices;
+In order to improve the like query performance, the NGram BloomFilter index was implemented.
## Create Column With NGram BloomFilter Index
diff --git a/docs/zh-CN/docs/data-table/index/inverted-index.md b/docs/zh-CN/docs/data-table/index/inverted-index.md
index 3ac4992519..15f7485d8e 100644
--- a/docs/zh-CN/docs/data-table/index/inverted-index.md
+++ b/docs/zh-CN/docs/data-table/index/inverted-index.md
@@ -52,7 +52,7 @@ Doris倒排索引的功能简要介绍如下:
- 增加了字符串类型的全文检索
- 支持字符串全文检索,包括同时匹配多个关键字MATCH_ALL、匹配任意一个关键字MATCH_ANY、匹配短语词组MATCH_PHRASE
- 支持字符串数组类型的全文检索
- - 支持英文、中文以及混合类型分词
+ - 支持英文、中文以及Unicode多语言分词
- 加速普通等值、范围查询,覆盖bitmap索引的功能,未来会代替bitmap索引
- 支持字符串、数值、日期时间类型的 =, !=, >, >=, <, <= 快速过滤
- 支持字符串、数字、日期时间数组类型的 =, !=, >, >=, <, <=
@@ -72,16 +72,16 @@ Doris倒排索引的功能简要介绍如下:
- parser指定分词器
- 默认不指定代表不分词
- english是英文分词,适合被索引列是英文的情况,用空格和标点符号分词,性能高
- - chinese是中文分词,适合被索引列有中文或者中英文混合的情况,性能比english分词低
- - unicode是混合类型分词,适用于中英文混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文字符进行1-gram分词。
- - parser_mode用于指定中文分词的模式
- - fine_grained模式,系统将对可以进行分词的部分都进行详尽的分词处理
- - coarse_grained模式,系统则依据最大化原则,执行精确且全面的分词操作
- - 默认coarse_grained模式
- - support_phrase用于指定索引是否需要支持短语模式
- - true为需要
- - false为不需要
- - 默认false不需要
+ - chinese是中文分词,适合被索引列主要是中文的情况,性能比english分词低
+ - unicode是多语言混合类型分词,适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文按字符分词。
+ - parser_mode用于指定分词的模式,目前parser = chinese时支持如下几种模式:
+ - fine_grained:细粒度模式,倾向于分出比较短的词,比如 '武汉长江大桥' 会分成 '武汉', '武汉市', '市长', '长江', '长江大桥', '大桥' 6个词
+ - coarse_grained:粗粒度模式,倾向于分出比较长的词,,比如 '武汉长江大桥' 会分成 '武汉市' '长江大桥' 2个词
+ - 默认coarse_grained
+ - support_phrase用于指定索引是否支持MATCH_PHRASE短语查询加速
+ - true为支持,但是索引需要更多的存储空间
+ - false为不支持,更省存储空间,可以用MATCH_ALL查询多个关键字
+ - 默认false
- COMMENT 是可选的,用于指定注释
```sql
@@ -150,7 +150,7 @@ USE test_inverted_index;
-- 创建表的同时创建了comment的倒排索引idx_comment
-- USING INVERTED 指定索引类型是倒排索引
--- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文混合分词,如果不指定"parser"参数表示不分词
+-- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文多语言混合分词,如果不指定"parser"参数表示不分词
CREATE TABLE hackernews_1m
(
`id` BIGINT,
diff --git a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
index ea6304253a..27e2b23592 100644
--- a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
+++ b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
@@ -29,7 +29,7 @@ under the License.
<version since="2.0.0">
</version>
-为了提升like的查询性能,增加了NGram BloomFilter索引,其实现主要参照了ClickHouse的ngrambf。
+为了提升like的查询性能,增加了NGram BloomFilter索引。
## NGram BloomFilter创建
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org