You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/06/23 08:44:00 UTC

[GitHub] [doris] mrhhsg opened a new pull request, #10370: [improvement]Support vectorized predicates for dict columns

mrhhsg opened a new pull request, #10370:
URL: https://github.com/apache/doris/pull/10370

   # Proposed changes
   
   Issue Number: close #10369
   
   ## Problem Summary:
   
   String (char/varchar/string) columns with dict encoding can be used as int in EQ/NE/LE/LT/GE/GE predicates.
   
   Test with SSB flat 100G, q2.3:
   ```sql
   SELECT
       SUM(LO_REVENUE), (LO_ORDERDATE DIV 10000) AS YEAR,
       P_BRAND
   FROM lineorder_flat
   WHERE
       P_BRAND = 'MFGR#2239'
       AND S_REGION = 'EUROPE'
   GROUP BY YEAR, P_BRAND
   ORDER BY YEAR, P_BRAND;
   ```
   ||master|dict_vec_opt|
   |-|-|-|
   |time|2.03 sec|1.65 sec|
   |profile|<img width="540" alt="image" src="https://user-images.githubusercontent.com/1179834/175255057-2284ffb1-a8ae-41c9-9093-c88b82d28916.png">|<img width="518" alt="image" src="https://user-images.githubusercontent.com/1179834/175255286-c9e6c71a-4bb2-4a46-84d4-55939b08e073.png">|
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
mrhhsg commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r905013078


##########
be/src/olap/comparison_predicate.cpp:
##########
@@ -251,8 +249,28 @@ COMPARISON_PRED_COLUMN_EVALUATE(GreaterEqualPredicate, >=, true)
                     flags[i] = (data_array_uint32_t[i] OP int32_val) && (!null_bitmap[i]);       \
                 }                                                                                \
             } else {                                                                             \
-                for (uint16_t i = 0; i < size; i++) {                                            \
-                    flags[i] = (data_array[i] OP _value) && (!null_bitmap[i]);                   \
+                if (nested_col.is_column_dictionary()) {                                         \
+                    if constexpr (IS_RANGE) column.convert_dict_codes_if_necessary();            \
+                    if constexpr (std::is_same_v<T, StringValue>) {                              \

Review Comment:
   To avoid compiling with other types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] Gabriel39 commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
Gabriel39 commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904902780


##########
be/src/olap/comparison_predicate.cpp:
##########
@@ -251,8 +249,28 @@ COMPARISON_PRED_COLUMN_EVALUATE(GreaterEqualPredicate, >=, true)
                     flags[i] = (data_array_uint32_t[i] OP int32_val) && (!null_bitmap[i]);       \
                 }                                                                                \
             } else {                                                                             \
-                for (uint16_t i = 0; i < size; i++) {                                            \
-                    flags[i] = (data_array[i] OP _value) && (!null_bitmap[i]);                   \
+                if (nested_col.is_column_dictionary()) {                                         \
+                    if constexpr (IS_RANGE) column.convert_dict_codes_if_necessary();            \
+                    if constexpr (std::is_same_v<T, StringValue>) {                              \

Review Comment:
   `T` can only be StringValue and this condition should be ignored



##########
be/src/olap/rowset/segment_v2/segment_iterator.h:
##########
@@ -127,6 +129,7 @@ class SegmentIterator : public RowwiseIterator {
     // _column_iterators.size() == _schema.num_columns()
     // _column_iterators[cid] == nullptr if cid is not in _schema
     std::vector<ColumnIterator*> _column_iterators;
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904812475


##########
be/src/olap/rowset/segment_v2/column_reader.h:
##########
@@ -90,6 +90,8 @@ class ColumnReader {
                          uint64_t num_rows, const FilePathDesc& path_desc,
                          std::unique_ptr<ColumnReader>* reader);
 
+    enum DictEncodingType { UNKNOWN_DICT_ENCODING, NOT_ALL_DICT_ENCODING, ALL_DICT_ENCODING };

Review Comment:
   NOT_ALL_DICT_ENCODING  --> PartialDictEncoding is better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei merged pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei merged PR #10370:
URL: https://github.com/apache/doris/pull/10370


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904819870


##########
be/src/olap/rowset/segment_v2/parsed_page.h:
##########
@@ -56,6 +57,10 @@ struct ParsedPage {
         RETURN_IF_ERROR(encoding->create_page_decoder(data_slice, opts, &page->data_decoder));
         RETURN_IF_ERROR(page->data_decoder->init());
 
+        if (auto dict_decoder = dynamic_cast<BinaryDictPageDecoder*>(page->data_decoder)) {

Review Comment:
   I am not sure we could use dynamic_cast to check if the decoder is dict page decoder.
   I think we'd better check EncodingInfo first and then cast the data decoder to dict decoder.
   
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904839481


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -652,6 +652,10 @@ void SegmentIterator::_vec_init_lazy_materialization() {
     std::set<ColumnId> del_cond_id_set;
     _opts.delete_condition_predicates->get_all_column_ids(del_cond_id_set);
 
+    if (config::enable_low_cardinality_optimize) {
+        _check_dict_columns_can_be_used_in_vec_predicates();

Review Comment:
   This method is not related with lazy materialization, should not in _vec_init_lazy_materialization.
   Pls add this in 
   if (is_vec) {
   _vec_init_lazy_materialization();
   _vec_init_char_column_id();
   } 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #10370:
URL: https://github.com/apache/doris/pull/10370#issuecomment-1164513141

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904809492


##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -510,6 +514,18 @@ FileColumnIterator::FileColumnIterator(ColumnReader* reader) : _reader(reader) {
 Status FileColumnIterator::init(const ColumnIteratorOptions& opts) {
     _opts = opts;
     RETURN_IF_ERROR(get_block_compression_codec(_reader->get_compression(), _compress_codec));
+    if (config::enable_low_cardinality_optimize) {
+        auto dict_encoding_type = _reader->get_dict_encoding_type();
+        if (dict_encoding_type != ColumnReader::UNKNOWN_DICT_ENCODING) {

Review Comment:
   better  if (xxx == xxx) {} else {}
   if ( xxx != xxx)  is very hard to understand.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904908531


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -881,44 +898,36 @@ uint16_t SegmentIterator::_evaluate_vectorization_predicate(uint16_t* sel_rowid_
     _pre_eval_block_predicate->evaluate_vec(_current_return_columns, selected_size, ret_flags);
 
     uint16_t new_size = 0;
-    size_t num_zeros = simd::count_zero_num(reinterpret_cast<int8_t*>(ret_flags), original_size);
-    if (0 == num_zeros) {
-        for (uint16_t i = 0; i < original_size; i++) {

Review Comment:
   why do you remove these optimizations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904839481


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -652,6 +652,10 @@ void SegmentIterator::_vec_init_lazy_materialization() {
     std::set<ColumnId> del_cond_id_set;
     _opts.delete_condition_predicates->get_all_column_ids(del_cond_id_set);
 
+    if (config::enable_low_cardinality_optimize) {
+        _check_dict_columns_can_be_used_in_vec_predicates();

Review Comment:
   This method is not related with lazy materialization, should not in _vec_init_lazy_materialization.
   Pls add this in 
   Status SegmentIterator::_init(bool is_vec) {
       SCOPED_RAW_TIMER(&_opts.stats->block_init_ns);
       DorisMetrics::instance()->segment_read_total->increment(1);
       // get file handle from file descriptor of segment
       fs::BlockManager* block_mgr = fs::fs_util::block_manager(_segment->_path_desc);
       RETURN_IF_ERROR(block_mgr->open_block(_segment->_path_desc, &_rblock));
       _row_bitmap.addRange(0, _segment->num_rows());
       RETURN_IF_ERROR(_init_return_column_iterators());
       RETURN_IF_ERROR(_init_bitmap_index_iterators());
       // z-order can not use prefix index
       if (_segment->_tablet_schema->sort_type() != SortType::ZORDER) {
           RETURN_IF_ERROR(_get_row_ranges_by_keys());
       }
       RETURN_IF_ERROR(_get_row_ranges_by_column_conditions());
       if (is_vec) {
           _vec_init_lazy_materialization();
           _vec_init_char_column_id();
       } else {
           _init_lazy_materialization();
       }
       _range_iter.reset(new BitmapRangeIterator(_row_bitmap));
       return Status::OK();
   }
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
mrhhsg commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r905023241


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -881,44 +898,36 @@ uint16_t SegmentIterator::_evaluate_vectorization_predicate(uint16_t* sel_rowid_
     _pre_eval_block_predicate->evaluate_vec(_current_return_columns, selected_size, ret_flags);
 
     uint16_t new_size = 0;
-    size_t num_zeros = simd::count_zero_num(reinterpret_cast<int8_t*>(ret_flags), original_size);
-    if (0 == num_zeros) {
-        for (uint16_t i = 0; i < original_size; i++) {

Review Comment:
   The calling of `count_zero_num`(will consume a non-negligible amount of time) is useless  when `num_zeros != 0 && num_zeros != original_size`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904837824


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -774,6 +780,39 @@ void SegmentIterator::_vec_init_lazy_materialization() {
     }
 }
 
+void SegmentIterator::_check_dict_columns_can_be_used_in_vec_predicates() {
+    std::vector<ColumnPredicate*> converted_predicate(_col_predicates.size());
+    for (size_t i = 0; i < _col_predicates.size(); ++i) {
+        auto predicate = _col_predicates[i];
+        auto cid = predicate->column_id();
+        FieldType type = _schema.column(cid)->type();
+
+        if (type == OLAP_FIELD_TYPE_VARCHAR || type == OLAP_FIELD_TYPE_CHAR ||
+            type == OLAP_FIELD_TYPE_STRING) {
+            if (_column_iterators[cid]->is_all_dict_encoding()) {
+                switch (predicate->type()) {
+                case PredicateType::EQ:
+                case PredicateType::NE:
+                case PredicateType::LE:
+                case PredicateType::LT:
+                case PredicateType::GE:
+                case PredicateType::GT: {
+                    if (_dict_columns_for_vec_predicates.contains(cid)) {
+                        _dict_columns_for_vec_predicates[cid] &= true;

Review Comment:
   what does this mean?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10370: [improvement]Support vectorized predicates for dict columns

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10370:
URL: https://github.com/apache/doris/pull/10370#discussion_r904820542


##########
be/src/olap/rowset/segment_v2/parsed_page.h:
##########
@@ -93,6 +98,8 @@ struct ParsedPage {
     // this means next row we will read
     ordinal_t offset_in_page = 0;
 
+    bool is_dict_encoding;

Review Comment:
   set the default value=false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org