You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "mrhhsg (via GitHub)" <gi...@apache.org> on 2023/06/27 07:36:33 UTC

[GitHub] [doris] mrhhsg opened a new pull request, #21239: [improvement](olap) filter the whole segment by dictionary

mrhhsg opened a new pull request, #21239:
URL: https://github.com/apache/doris/pull/21239

   ## Proposed changes
   
   If a column in a segment is encoded as a dictionary, then it is possible to determine whether the entire segment can be filtered out using the dictionary, which could potentially avoid reading a large amount of data.
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1609069054

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 42.56 seconds
    stream load tsv:          455 seconds loaded 74807831229 Bytes, about 156 MB/s
    stream load json:         22 seconds loaded 2358488459 Bytes, about 102 MB/s
    stream load orc:          58 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          29 seconds loaded 861443392 Bytes, about 28 MB/s
    insert into select:          69.6 seconds inserted 10000000 Rows, about 143K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230627084918_clickbench_pr_168603.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1609167673

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21239:
URL: https://github.com/apache/doris/pull/21239#discussion_r1245069734


##########
be/src/olap/in_list_predicate.h:
##########
@@ -346,6 +346,17 @@ class InListPredicateBase : public ColumnPredicate {
         }
     }
 
+    bool evaluate_and(const StringRef* dict_words, const size_t count) const override {

Review Comment:
   In such scenaro:
   1. the dict is [1,2,3]
   2. the query condition is where a not in [1]
   3. opposite = false
   then
   found = true and PT != IN LIST then the return value is false;
   Its wrong 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1612286037

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21239:
URL: https://github.com/apache/doris/pull/21239#discussion_r1244817687


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -490,6 +490,21 @@ Status SegmentIterator::_get_row_ranges_from_conditions(RowRanges* condition_row
     RowRanges::ranges_intersection(*condition_row_ranges, zone_map_row_ranges,
                                    condition_row_ranges);
     _opts.stats->rows_stats_filtered += (pre_size - condition_row_ranges->count());
+
+    RowRanges dict_row_ranges = RowRanges::create_single(num_rows());

Review Comment:
   这里加一个判断,只有 read query 的时候才走, compaction 别走了,万一咱们的逻辑有bug,把数据搞错了。



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1611024824

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1612286013

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21239:
URL: https://github.com/apache/doris/pull/21239#discussion_r1245069734


##########
be/src/olap/in_list_predicate.h:
##########
@@ -346,6 +346,17 @@ class InListPredicateBase : public ColumnPredicate {
         }
     }
 
+    bool evaluate_and(const StringRef* dict_words, const size_t count) const override {

Review Comment:
   In such scenaro:
   1. the dict is [1,2,3]
   2. the query condition is where a not in [1]
   3. opposite = false
   then
   found = true and PT != IN LIST then the return value is false;
   Its wrong 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1610562276

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 40.48 seconds
    stream load tsv:          461 seconds loaded 74807831229 Bytes, about 154 MB/s
    stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
    stream load orc:          57 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          29 seconds loaded 861443392 Bytes, about 28 MB/s
    insert into select:          69.6 seconds inserted 10000000 Rows, about 143K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230628023234_clickbench_pr_169031.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1608965936

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21239:
URL: https://github.com/apache/doris/pull/21239#discussion_r1243497498


##########
be/src/olap/block_column_predicate.h:
##########
@@ -81,6 +81,12 @@ class BlockColumnPredicate {
         LOG(FATAL) << "should not reach here";
         return true;
     }
+
+    virtual bool evaluate_and(const StringRef* dict_words, const size_t dict_num) const {

Review Comment:
   why not pass in std::vector<StringRef>&



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1611032316

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1609179230

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1608968178

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21239:
URL: https://github.com/apache/doris/pull/21239#discussion_r1244825881


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -490,6 +490,21 @@ Status SegmentIterator::_get_row_ranges_from_conditions(RowRanges* condition_row
     RowRanges::ranges_intersection(*condition_row_ranges, zone_map_row_ranges,
                                    condition_row_ranges);
     _opts.stats->rows_stats_filtered += (pre_size - condition_row_ranges->count());
+
+    RowRanges dict_row_ranges = RowRanges::create_single(num_rows());

Review Comment:
   并且加上注释,写明为啥,防止后来的人给改了。。。



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on PR #21239:
URL: https://github.com/apache/doris/pull/21239#issuecomment-1611023121

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on a diff in pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on code in PR #21239:
URL: https://github.com/apache/doris/pull/21239#discussion_r1243847583


##########
be/src/olap/block_column_predicate.h:
##########
@@ -81,6 +81,12 @@ class BlockColumnPredicate {
         LOG(FATAL) << "should not reach here";
         return true;
     }
+
+    virtual bool evaluate_and(const StringRef* dict_words, const size_t dict_num) const {

Review Comment:
   Cause `dict_words` is from `std::unique_ptr<StringRef[]> _dict_word_info` of `FileColumnIterator` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei merged pull request #21239: [improvement](olap) filter the whole segment by dictionary

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei merged PR #21239:
URL: https://github.com/apache/doris/pull/21239


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org