You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/06/06 08:23:35 UTC

[GitHub] [incubator-doris] englefly opened a new pull request, #9969: [enhance] improve dict in-predicate evaluate

englefly opened a new pull request, #9969:
URL: https://github.com/apache/incubator-doris/pull/9969

   # Proposed changes
   when column is dict encoded, in_predicate::evaluate() put dict code of IN_LIST into a set, and then compare the selected codes with column cells. The selected dict codes are put into a map, by which the evaluate time is O(logN)
   It is better to use a std::vector<bool> to indicated if a dict word is in IN_LIST, by which the evaluate time shrinks to O(1).
   
   test evn: ssb_flat 5G 
   test sql: simplified from ssb q.3.3
   SELECT count(c_city) FROM lineorder_flat WHERE C_CITY in ('UNITED KI1','UNITED KI5');
   
   ShortPredEvalTime is decreased from 240ms to 170ms by average.
    
   
   
   
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   Describe the overview of changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] englefly commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
englefly commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r891877855


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -161,12 +162,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                         reinterpret_cast<vectorized::ColumnDictionary<vectorized::Int32>&>(      \
                                 column);                                                         \
                 auto& data_array = dict_col.get_data();                                          \
-                auto dict_codes = dict_col.find_codes(_values);                                  \
+                std::vector<bool> selected;                                                      \
+                dict_col.find_codes(_values, selected);                                          \
                 for (uint16_t i = 0; i < *size; i++) {                                           \
                     uint16_t idx = sel[i];                                                       \
                     sel[new_size] = idx;                                                         \
                     const auto& cell_value = data_array[idx];                                    \
-                    auto result = (dict_codes.find(cell_value) OP dict_codes.end());             \
+                    assert(cell_value < selected.size());                                        \

Review Comment:
   assert only works in debug mode。 It will not affect the performance of release version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r891876811


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -161,12 +162,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                         reinterpret_cast<vectorized::ColumnDictionary<vectorized::Int32>&>(      \
                                 column);                                                         \
                 auto& data_array = dict_col.get_data();                                          \
-                auto dict_codes = dict_col.find_codes(_values);                                  \
+                std::vector<bool> selected;                                                      \
+                dict_col.find_codes(_values, selected);                                          \
                 for (uint16_t i = 0; i < *size; i++) {                                           \
                     uint16_t idx = sel[i];                                                       \
                     sel[new_size] = idx;                                                         \
                     const auto& cell_value = data_array[idx];                                    \
-                    auto result = (dict_codes.find(cell_value) OP dict_codes.end());             \
+                    assert(cell_value < selected.size());                                        \

Review Comment:
   Yes, doris uses dcheck instead of assert.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r889946993


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -161,12 +162,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                         reinterpret_cast<vectorized::ColumnDictionary<vectorized::Int32>&>(      \
                                 column);                                                         \
                 auto& data_array = dict_col.get_data();                                          \
-                auto dict_codes = dict_col.find_codes(_values);                                  \
+                std::vector<bool> selected;                                                  \
+                size_t dict_word_num = 0;                                                    \

Review Comment:
   dict_word_num is never used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] github-actions[bot] commented on pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#issuecomment-1148082047

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] HappenLee commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
HappenLee commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r889975221


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -132,13 +132,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                     auto* nested_col_ptr = vectorized::check_and_get_column<                     \
                             vectorized::ColumnDictionary<vectorized::Int32>>(nested_col);        \
                     auto& data_array = nested_col_ptr->get_data();                               \
-                    auto dict_codes = nested_col_ptr->find_codes(_values);                       \
+                    std::vector<bool> selected;                                                  \
+                    size_t dict_word_num = 0;                                                    \
+                    nested_col_ptr->find_codes(_values, selected, dict_word_num);                \
                     for (uint16_t i = 0; i < *size; i++) {                                       \
                         uint16_t idx = sel[i];                                                   \
                         sel[new_size] = idx;                                                     \
                         const auto& cell_value = data_array[idx];                                \
-                        bool ret = !null_bitmap[idx] &&                                          \
-                                   (dict_codes.find(cell_value) OP dict_codes.end());            \
+                        bool ret = !null_bitmap[idx] && (selected[cell_value] OP false);                    \

Review Comment:
   code format of `\`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] englefly commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
englefly commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r890258845


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -132,13 +132,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                     auto* nested_col_ptr = vectorized::check_and_get_column<                     \
                             vectorized::ColumnDictionary<vectorized::Int32>>(nested_col);        \
                     auto& data_array = nested_col_ptr->get_data();                               \
-                    auto dict_codes = nested_col_ptr->find_codes(_values);                       \
+                    std::vector<bool> selected;                                                  \
+                    size_t dict_word_num = 0;                                                    \
+                    nested_col_ptr->find_codes(_values, selected, dict_word_num);                \
                     for (uint16_t i = 0; i < *size; i++) {                                       \
                         uint16_t idx = sel[i];                                                   \
                         sel[new_size] = idx;                                                     \
                         const auto& cell_value = data_array[idx];                                \
-                        bool ret = !null_bitmap[idx] &&                                          \
-                                   (dict_codes.find(cell_value) OP dict_codes.end());            \
+                        bool ret = !null_bitmap[idx] && (selected[cell_value] OP false);                    \

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] zenoyang commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
zenoyang commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r890749380


##########
be/src/vec/columns/column_dictionary.h:
##########
@@ -258,9 +258,9 @@ class ColumnDictionary final : public COWHelper<IColumn, ColumnDictionary<T>> {
 
     uint32_t get_hash_value(uint32_t idx) const { return _dict.get_hash_value(_codes[idx]); }
 
-    phmap::flat_hash_set<int32_t> find_codes(
-            const phmap::flat_hash_set<StringValue>& values) const {
-        return _dict.find_codes(values);
+    void find_codes(const phmap::flat_hash_set<StringValue>& values,
+                    std::vector<bool>& selected) const {
+        return _dict.find_codes(values, selected);

Review Comment:
   `find_codes` passes the result `selected` by reference, and not by `return`. This is not consistent with other interfaces, such as `find_code`, `find_code_by_bound`. I think it's better to keep it uniform.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] englefly commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
englefly commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r890116541


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -161,12 +162,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                         reinterpret_cast<vectorized::ColumnDictionary<vectorized::Int32>&>(      \
                                 column);                                                         \
                 auto& data_array = dict_col.get_data();                                          \
-                auto dict_codes = dict_col.find_codes(_values);                                  \
+                std::vector<bool> selected;                                                  \
+                size_t dict_word_num = 0;                                                    \

Review Comment:
   已经去掉了 dict_word_num 参数



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] englefly commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
englefly commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r889961022


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -161,12 +162,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                         reinterpret_cast<vectorized::ColumnDictionary<vectorized::Int32>&>(      \
                                 column);                                                         \
                 auto& data_array = dict_col.get_data();                                          \
-                auto dict_codes = dict_col.find_codes(_values);                                  \
+                std::vector<bool> selected;                                                  \
+                size_t dict_word_num = 0;                                                    \

Review Comment:
   I am not sure if there should be an assert(cell_value<dict_word_num).
   I will add this assert.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] zenoyang commented on a diff in pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
zenoyang commented on code in PR #9969:
URL: https://github.com/apache/incubator-doris/pull/9969#discussion_r890971920


##########
be/src/olap/in_list_predicate.cpp:
##########
@@ -161,12 +162,14 @@ IN_LIST_PRED_COLUMN_BLOCK_EVALUATE(NotInListPredicate, ==)
                         reinterpret_cast<vectorized::ColumnDictionary<vectorized::Int32>&>(      \
                                 column);                                                         \
                 auto& data_array = dict_col.get_data();                                          \
-                auto dict_codes = dict_col.find_codes(_values);                                  \
+                std::vector<bool> selected;                                                      \
+                dict_col.find_codes(_values, selected);                                          \
                 for (uint16_t i = 0; i < *size; i++) {                                           \
                     uint16_t idx = sel[i];                                                       \
                     sel[new_size] = idx;                                                         \
                     const auto& cell_value = data_array[idx];                                    \
-                    auto result = (dict_codes.find(cell_value) OP dict_codes.end());             \
+                    assert(cell_value < selected.size());                                        \

Review Comment:
   Calling `assert` in a loop can affect performance. `selected.size()` is actually the size of the dict, and `cell_value` must be smaller than the size of the dict, which is guaranteed when writing, so I don't think there is any need to add an assertion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei closed pull request #9969: [enhance] improve dict in-predicate evaluate

Posted by GitBox <gi...@apache.org>.
yiguolei closed pull request #9969: [enhance] improve dict in-predicate evaluate
URL: https://github.com/apache/incubator-doris/pull/9969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org