You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/06/30 03:23:25 UTC

[GitHub] [doris] mrhhsg opened a new pull request, #10506: [improvement]Add reading by rowids to speed up lazy materialization

mrhhsg opened a new pull request, #10506:
URL: https://github.com/apache/doris/pull/10506

   # Proposed changes
   
   Issue Number: close #10505
   
   ## Problem Summary:
   
   ### Run ssb-flat(100g) q2.3(1be + 1 fe)
   |version|master|lazy_materialization_opt|
   |-|-|-|
   |total|1.12 sec|1.29 sec|
   |profile(cached)|<img width="294" alt="image" src="https://user-images.githubusercontent.com/1179834/176581643-2b2a870c-8543-4e25-97de-cc8ac1493a63.png">|<img width="300" alt="image" src="https://user-images.githubusercontent.com/1179834/176582871-56ab041f-b3e6-48a0-85fd-3d0b1f0bf0ff.png">|
   |profile(no cache)|<img width="302" alt="image" src="https://user-images.githubusercontent.com/1179834/176582632-85ec60b5-f602-426f-a759-425d250eb8a8.png">|<img width="293" alt="image" src="https://user-images.githubusercontent.com/1179834/176582443-a74ec61c-8b7f-4b5a-ba9c-594d5caa5dc0.png">|
   
   ### Run all ssb-flat(100g) queries(3be + 1fe)
   |query|master|lazy_materialization_opt|
   |-|-|-|
   |q1.1|59|60|
   |q1.2|56|53|
   |q1.3|64|71|
   |q2.1|159|108|
   |q2.2|151|101|
   |q2.3|141|79|
   |q3.1|314|198|
   |q3.2|115|106|
   |q3.3|136|101|
   |q3.4|42|35|
   |q4.1|270|155|
   |q4.2|103|68|
   |q4.3|58|52|
   |total|1668|1187
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910591186


##########
be/src/olap/rowset/segment_v2/binary_plain_page.h:
##########
@@ -254,6 +254,33 @@ class BinaryPlainPageDecoder : public PageDecoder {
         return Status::OK();
     };
 
+    Status read_by_rowids(const rowid_t* rowids, ordinal_t page_first_ordinal, size_t* n,
+                          vectorized::MutableColumnPtr& dst) override {
+        DCHECK(_parsed);
+        if (PREDICT_FALSE(*n == 0)) {
+            *n = 0;
+            return Status::OK();
+        }
+
+        auto data = _data.mutable_data();
+        auto total = *n;
+        size_t read_count = 0;
+        for (size_t i = 0; i < total; ++i) {
+            ordinal_t ord = rowids[i] - page_first_ordinal;
+            if (UNLIKELY(ord >= _num_elems)) {
+                *n = read_count;

Review Comment:
   break here. Keep the logic the same as other decoders.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wangbo commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
wangbo commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910854608


##########
be/src/olap/rowset/segment_v2/bitshuffle_page.h:
##########
@@ -408,6 +408,32 @@ class BitShufflePageDecoder : public PageDecoder {
         return Status::OK();
     };
 
+    Status read_by_rowids(const rowid_t* rowids, ordinal_t page_first_ordinal, size_t* n,
+                          vectorized::MutableColumnPtr& dst) override {
+        DCHECK(_parsed);
+        if (PREDICT_FALSE(*n == 0)) {
+            *n = 0;
+            return Status::OK();
+        }
+
+        auto total = *n;
+        auto read_count = 0;
+        CppType data[total];
+        for (size_t i = 0; i < total; ++i) {
+            ordinal_t ord = rowids[i] - page_first_ordinal;
+            if (UNLIKELY(ord >= _num_elements)) {
+                break;
+            }
+
+            data[read_count++] = *reinterpret_cast<CppType*>(get_data(ord));

Review Comment:
   Why not using ```read_count``` instead of ```i```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wangbo commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
wangbo commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910893618


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -699,28 +699,12 @@ void SegmentIterator::_vec_init_lazy_materialization() {
 
     // Step 2: check non-predicate read costs to determine whether need lazy materialization
     // fill _non_predicate_columns.
-    // note(wb) For block schema, query layer and storage layer may have some diff
-    //   query layer block schema not contains delete column, but storage layer appends delete column to end of block schema
-    //   When output block to query layer, delete column can be skipped.
-    //  _schema.column_ids() stands for storage layer block schema, so it contains delete columnid
-    //  we just regard delete column as common pred column here.
+    // After some optimization, we suppose lazy materialization is better performance.
     if (_schema.column_ids().size() > pred_column_ids.size()) {
         for (auto cid : _schema.column_ids()) {
             if (!_is_pred_column[cid]) {
                 _non_predicate_columns.push_back(cid);
-                FieldType type = _schema.column(cid)->type();
-
-                // todo(wb) maybe we can make read char type faster
-                // todo(wb) support map/array type
-                // todo(wb) consider multiple integer columns cost, such as 1000 columns, maybe lazy materialization faster
-                if (!_lazy_materialization_read &&
-                    (_is_need_vec_eval ||
-                     _is_need_short_eval) && // only when pred exists, we need to consider lazy materialization
-                    (type == OLAP_FIELD_TYPE_HLL || type == OLAP_FIELD_TYPE_OBJECT ||
-                     type == OLAP_FIELD_TYPE_VARCHAR || type == OLAP_FIELD_TYPE_CHAR ||
-                     type == OLAP_FIELD_TYPE_STRING || type == OLAP_FIELD_TYPE_BOOL ||
-                     type == OLAP_FIELD_TYPE_DATE || type == OLAP_FIELD_TYPE_DATETIME ||
-                     type == OLAP_FIELD_TYPE_DECIMAL)) {
+                if (_is_need_vec_eval || _is_need_short_eval) {

Review Comment:
   This check can be moved out of the foreach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910590598


##########
be/src/olap/rowset/segment_v2/binary_plain_page.h:
##########
@@ -254,6 +254,33 @@ class BinaryPlainPageDecoder : public PageDecoder {
         return Status::OK();
     };
 
+    Status read_by_rowids(const rowid_t* rowids, ordinal_t page_first_ordinal, size_t* n,
+                          vectorized::MutableColumnPtr& dst) override {
+        DCHECK(_parsed);
+        if (PREDICT_FALSE(*n == 0)) {
+            *n = 0;
+            return Status::OK();
+        }
+
+        auto data = _data.mutable_data();
+        auto total = *n;
+        size_t read_count = 0;
+        for (size_t i = 0; i < total; ++i) {
+            ordinal_t ord = rowids[i] - page_first_ordinal;
+            if (UNLIKELY(ord >= _num_elems)) {
+                *n = read_count;
+                return Status::OK();
+            }
+
+            const uint32_t start_offset = offset(ord);
+            dst->insert_data(data + start_offset, offset(ord + 1) - start_offset);

Review Comment:
   There is performance issues.  For column string, it will call resize and memcpy too many times.
   You can do a seperate loop to calculate the string size and resize the column and then insert data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910580765


##########
be/src/olap/rowset/segment_v2/segment_iterator.cpp:
##########
@@ -971,21 +955,13 @@ void SegmentIterator::_read_columns_by_rowids(std::vector<ColumnId>& read_column
                                               uint16_t* sel_rowid_idx, size_t select_size,
                                               vectorized::MutableColumns* mutable_columns) {
     SCOPED_RAW_TIMER(&_opts.stats->lazy_read_ns);
-    size_t start_idx = 0;
-    while (start_idx < select_size) {
-        size_t end_idx = start_idx + 1;
-        while (end_idx < select_size && (rowid_vector[sel_rowid_idx[end_idx - 1]] ==
-                                         rowid_vector[sel_rowid_idx[end_idx]] - 1)) {
-            end_idx++;
-        }
-        size_t range = end_idx - start_idx;
-        {
-            _opts.stats->block_lazy_read_seek_num += 1;
-            SCOPED_RAW_TIMER(&_opts.stats->block_lazy_read_seek_ns);
-            _seek_columns(read_column_ids, rowid_vector[sel_rowid_idx[start_idx]]);
-        }
-        _read_columns(read_column_ids, *mutable_columns, range);
-        start_idx += range;
+    rowid_t rowids[select_size];

Review Comment:
   Not sure dynamic array's performace. Better use vector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910590598


##########
be/src/olap/rowset/segment_v2/binary_plain_page.h:
##########
@@ -254,6 +254,33 @@ class BinaryPlainPageDecoder : public PageDecoder {
         return Status::OK();
     };
 
+    Status read_by_rowids(const rowid_t* rowids, ordinal_t page_first_ordinal, size_t* n,
+                          vectorized::MutableColumnPtr& dst) override {
+        DCHECK(_parsed);
+        if (PREDICT_FALSE(*n == 0)) {
+            *n = 0;
+            return Status::OK();
+        }
+
+        auto data = _data.mutable_data();
+        auto total = *n;
+        size_t read_count = 0;
+        for (size_t i = 0; i < total; ++i) {
+            ordinal_t ord = rowids[i] - page_first_ordinal;
+            if (UNLIKELY(ord >= _num_elems)) {
+                *n = read_count;
+                return Status::OK();
+            }
+
+            const uint32_t start_offset = offset(ord);
+            dst->insert_data(data + start_offset, offset(ord + 1) - start_offset);

Review Comment:
   There is performance issues.  For column string, it will call resize and memcpy too many times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei merged pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei merged PR #10506:
URL: https://github.com/apache/doris/pull/10506


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910576756


##########
be/src/olap/rowset/segment_v2/binary_dict_page.cpp:
##########
@@ -261,6 +261,38 @@ Status BinaryDictPageDecoder::next_batch(size_t* n, vectorized::MutableColumnPtr
     return Status::OK();
 }
 
+Status BinaryDictPageDecoder::read_by_rowids(const rowid_t* rowids, ordinal_t page_first_ordinal,

Review Comment:
   read_by_rowids ---> next_batch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910758161


##########
be/src/olap/rowset/segment_v2/rle_page.h:
##########
@@ -254,6 +254,38 @@ class RlePageDecoder : public PageDecoder {
         return Status::OK();
     };
 
+    Status read_by_rowids(const rowid_t* rowids, ordinal_t page_first_ordinal, size_t* n,
+                          vectorized::MutableColumnPtr& dst) override {
+        DCHECK(_parsed);
+        if (PREDICT_FALSE(*n == 0 || _cur_index >= _num_elements)) {
+            *n = 0;
+            return Status::OK();
+        }
+
+        auto total = *n;
+        bool result = false;
+        size_t read_count = 0;
+        CppType value;
+        for (size_t i = 0; i < total; ++i) {
+            ordinal_t ord = rowids[i] - page_first_ordinal;
+            if (UNLIKELY(ord >= _num_elements)) {
+                *n = read_count;
+                return Status::OK();
+            }
+
+            _rle_decoder.Skip(ord - _cur_index);
+            _cur_index = ord;
+
+            result = _rle_decoder.Get(&value);
+            _cur_index++;

Review Comment:
   Not sure it is right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10506: [improvement]Add reading by rowids to speed up lazy materialization

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10506:
URL: https://github.com/apache/doris/pull/10506#discussion_r910581436


##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -704,6 +704,80 @@ Status FileColumnIterator::next_batch(size_t* n, vectorized::MutableColumnPtr& d
     return Status::OK();
 }
 
+Status FileColumnIterator::read_by_rowids(const rowid_t* rowids, const size_t count,

Review Comment:
   use vector<rowid_t>& as input, avoid using raw pointer if possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org