You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "ZhangYu0123 (via GitHub)" <gi...@apache.org> on 2023/04/24 16:05:10 UTC
[GitHub] [doris] ZhangYu0123 opened a new pull request, #19021: [Feature](index) support token_bf index for token search
ZhangYu0123 opened a new pull request, #19021:
URL: https://github.com/apache/doris/pull/19021
# Proposed changes
**Support token_bf index for token search:**
1. Token_bf index is mainly used to optimise English text searching accurately. It can split sentences by non-numeric and non-characters and construct bloom filter. When searching by like、not like、startsWith、in、not in、endswith, it can accelerate searching time. This pr is only support like.
2. vs ngram_bf index, In English text
(1) Token_bf index has 100% up.
(2) It doesn't need to provide ngram_size parameter.
3. vs inverted index
case sensitive
4. Limitation
In like '%xxx%' sql, token_bf index will not be used. Because the bloom filter records the whole token and it can't process part of it. We can use like '% xxx %' or hastoken(xxx) function to process.
**Test:**
2kw data, BUCKETS 1
```
CREATE TABLE IF NOT EXISTS hits_url4 (
UserID int,
url text DEFAULT '',
url_ngram3 text DEFAULT '',
url_ngram6 text DEFAULT '',
url_token text DEFAULT '',
url_inverted text DEFAULT '',
INDEX idx_ngrambf (`url_ngram3`) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'url_ngram ngram_bf index',
INDEX idx_ngrambf2 (`url_ngram6`) USING NGRAM_BF PROPERTIES("gram_size"="6", "bf_size"="1024") COMMENT 'url_ngram ngram_bf index',
INDEX url_token (`url_token`) USING TOKEN_BF PROPERTIES("bf_size"="1024") COMMENT 'url_token_bf index',
INDEX idx_inverted (`url_inverted`) USING INVERTED PROPERTIES("parser"="english") COMMENT 'url_inverted index'
)
DUPLICATE KEY(UserID)
DISTRIBUTED BY HASH(UserID) BUCKETS 1
PROPERTIES("replication_num" = "1")
```
| index type | speed | up |
|--------|--------|--------|
| none | 0.76s <img width="618" alt="image" src="https://user-images.githubusercontent.com/67053339/233016348-dca7b81d-1ff8-4fb2-811a-02c09d7f8ce3.png"> | - |
| ngram_bf gram=6 | 0.56s <img width="656" alt="image" src="https://user-images.githubusercontent.com/67053339/233034418-9b304548-b1c4-429d-8321-ef8c56fdc8f1.png"> | 36% |
| ngram_bf gram=3 | 0.17s <img width="666" alt="image" src="https://user-images.githubusercontent.com/67053339/233015812-a425c8b5-cfd2-48b1-9f32-0cbe0bc34409.png"> | 347% |
| token_bf | 0.08s <img width="667" alt="image" src="https://user-images.githubusercontent.com/67053339/233014026-8f969ecf-b2ba-4c8f-9c7e-381a434a5bc6.png"> | 850% |
Issue Number: close #xxx
## Problem summary
Describe your changes.
## Checklist(Required)
* [ ] Does it affect the original behavior
* [ ] Has unit tests been added
* [ ] Has document been added or modified
* [ ] Does it need to update dependencies
* [ ] Is this PR support rollback (If NO, please explain WHY)
## Further comments
If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1524756553
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] hello-stephen commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520525811
TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 33.74 seconds
stream load tsv: 428 seconds loaded 74807831229 Bytes, about 166 MB/s
stream load json: 24 seconds loaded 2358488459 Bytes, about 93 MB/s
stream load orc: 60 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230424170038_clickbench_pr_134221.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521685727
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520473455
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520820145
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527066400
run p1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527053767
run p1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520465043
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520462449
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521042656
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on a diff in pull request #19021: [Feature](index) support token_bf index for token search
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on code in PR #19021:
URL: https://github.com/apache/doris/pull/19021#discussion_r1175513933
##########
be/src/olap/itoken_extractor.h:
##########
@@ -93,6 +93,19 @@ struct NgramTokenExtractor final : public ITokenExtractorHelper<NgramTokenExtrac
private:
size_t n;
};
+
+/// Parser extracting all splits from string.
+struct SplitTokenExtractor final : public ITokenExtractorHelper<SplitTokenExtractor> {
+public:
+ SplitTokenExtractor() {}
Review Comment:
warning: use '= default' to define a trivial default constructor [modernize-use-equals-default]
```suggestion
SplitTokenExtractor() = default;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521336748
run p0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1522508551
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521192435
run p0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521025368
run p0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527145002
run p1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1526906134
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1525881139
run p0
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1526510299
run p1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1523517890
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search
Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527019675
run p1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
Re: [PR] [Feature](index) support token_bf index for token search [doris]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1780221841
We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
Re: [PR] [Feature](index) support token_bf index for token search [doris]
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #19021: [Feature](index) support token_bf index for token search
URL: https://github.com/apache/doris/pull/19021
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org