You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "ZhangYu0123 (via GitHub)" <gi...@apache.org> on 2023/04/24 16:05:10 UTC

[GitHub] [doris] ZhangYu0123 opened a new pull request, #19021: [Feature](index) support token_bf index for token search

ZhangYu0123 opened a new pull request, #19021:
URL: https://github.com/apache/doris/pull/19021

   # Proposed changes
   **Support token_bf index for token search:** 
   
   1.  Token_bf index is mainly used to optimise English text searching accurately.  It can split sentences by non-numeric and non-characters and construct bloom filter. When searching by like、not like、startsWith、in、not in、endswith, it can accelerate searching time.  This pr is only support like.
   2. vs ngram_bf index, In English text
      (1) Token_bf index has 100% up. 
      (2) It doesn't need to provide ngram_size parameter.
   3. vs inverted index
       case sensitive
   4. Limitation
    In like '%xxx%' sql,  token_bf index will not be used. Because the bloom filter records the whole token and it can't process part of it.  We can use like '% xxx %'  or hastoken(xxx)  function to process.
   
    **Test:**
   2kw data,  BUCKETS 1
   ```
          CREATE TABLE IF NOT EXISTS hits_url4 (
                   UserID int,
                   url text DEFAULT '',
                   url_ngram3 text DEFAULT '',
                   url_ngram6 text DEFAULT '',
                   url_token text DEFAULT '',
                   url_inverted text DEFAULT '',
                   INDEX idx_ngrambf (`url_ngram3`) USING NGRAM_BF PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'url_ngram ngram_bf index',
                   INDEX idx_ngrambf2 (`url_ngram6`) USING NGRAM_BF PROPERTIES("gram_size"="6", "bf_size"="1024") COMMENT 'url_ngram ngram_bf index',
                  INDEX url_token (`url_token`) USING TOKEN_BF PROPERTIES("bf_size"="1024") COMMENT 'url_token_bf index', 
                  INDEX idx_inverted (`url_inverted`) USING INVERTED PROPERTIES("parser"="english") COMMENT 'url_inverted index'
               )
               DUPLICATE  KEY(UserID)
               DISTRIBUTED BY HASH(UserID) BUCKETS 1
               PROPERTIES("replication_num" = "1")
   ```
   
   | index type | speed | up |
   |--------|--------|--------|
   | none | 0.76s <img width="618" alt="image" src="https://user-images.githubusercontent.com/67053339/233016348-dca7b81d-1ff8-4fb2-811a-02c09d7f8ce3.png"> | - | 
   | ngram_bf gram=6 | 0.56s <img width="656" alt="image" src="https://user-images.githubusercontent.com/67053339/233034418-9b304548-b1c4-429d-8321-ef8c56fdc8f1.png"> | 36% | 
   | ngram_bf gram=3 | 0.17s <img width="666" alt="image" src="https://user-images.githubusercontent.com/67053339/233015812-a425c8b5-cfd2-48b1-9f32-0cbe0bc34409.png"> | 347% | 
   | token_bf | 0.08s <img width="667" alt="image" src="https://user-images.githubusercontent.com/67053339/233014026-8f969ecf-b2ba-4c8f-9c7e-381a434a5bc6.png"> | 850% | 
   
   
   Issue Number: close #xxx
   
   ## Problem summary
   
   Describe your changes.
   
   ## Checklist(Required)
   
   * [ ] Does it affect the original behavior
   * [ ] Has unit tests been added
   * [ ] Has document been added or modified
   * [ ] Does it need to update dependencies
   * [ ] Is this PR support rollback (If NO, please explain WHY)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1524756553

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520525811

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 33.74 seconds
    stream load tsv:          428 seconds loaded 74807831229 Bytes, about 166 MB/s
    stream load json:         24 seconds loaded 2358488459 Bytes, about 93 MB/s
    stream load orc:          60 seconds loaded 1101869774 Bytes, about 17 MB/s
    stream load parquet:          30 seconds loaded 861443392 Bytes, about 27 MB/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230424170038_clickbench_pr_134221.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521685727

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520473455

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520820145

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527066400

   run p1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527053767

   run p1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520465043

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1520462449

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521042656

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on a diff in pull request #19021: [Feature](index) support token_bf index for token search

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on code in PR #19021:
URL: https://github.com/apache/doris/pull/19021#discussion_r1175513933


##########
be/src/olap/itoken_extractor.h:
##########
@@ -93,6 +93,19 @@ struct NgramTokenExtractor final : public ITokenExtractorHelper<NgramTokenExtrac
 private:
     size_t n;
 };
+
+/// Parser extracting all splits from string.
+struct SplitTokenExtractor final : public ITokenExtractorHelper<SplitTokenExtractor> {
+public:
+    SplitTokenExtractor() {}

Review Comment:
   warning: use '= default' to define a trivial default constructor [modernize-use-equals-default]
   
   ```suggestion
       SplitTokenExtractor() = default;
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521336748

   run p0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1522508551

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521192435

   run p0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1521025368

   run p0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527145002

   run p1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1526906134

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1525881139

   run p0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1526510299

   run p1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1523517890

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] ZhangYu0123 commented on pull request #19021: [Feature](index) support token_bf index for token search

Posted by "ZhangYu0123 (via GitHub)" <gi...@apache.org>.
ZhangYu0123 commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1527019675

   run p1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


Re: [PR] [Feature](index) support token_bf index for token search [doris]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19021:
URL: https://github.com/apache/doris/pull/19021#issuecomment-1780221841

   We're closing this PR because it hasn't been updated in a while.
   This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and feel free a maintainer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


Re: [PR] [Feature](index) support token_bf index for token search [doris]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed pull request #19021: [Feature](index) support token_bf index for token search
URL: https://github.com/apache/doris/pull/19021


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org