You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/06/08 07:59:59 UTC

[GitHub] [incubator-doris] mrhhsg opened a new pull request, #10016: [enhancement]Cache parsed page data for bitshuffle encoding

mrhhsg opened a new pull request, #10016:
URL: https://github.com/apache/incubator-doris/pull/10016

   # Proposed changes
   
   Issue Number: close #10015
   
   ## Problem Summary:
   
   Describe the overview of changes.
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r892186809


##########
be/src/olap/rowset/segment_v2/binary_dict_page.cpp:
##########
@@ -313,5 +315,23 @@ Status BinaryDictPageDecoder::next_batch(size_t* n, ColumnBlockView* dst) {
     return Status::OK();
 }
 
+BinaryDictPageDecoder* BinaryDictPageDecoder::clone_for_cache() const {

Review Comment:
   doris's code style is  Status BinaryDictPageDecoder::clone_for_cache(std::unique_ptr< BinaryDictPageDecoder >& BinaryDictPageDecoder) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r892190895


##########
be/src/olap/rowset/segment_v2/page_decoder.h:
##########
@@ -102,6 +102,8 @@ class PageDecoder {
 
     bool has_remaining() const { return current_index() < count(); }
 
+    virtual PageDecoder* clone_for_cache() const = 0;

Review Comment:
   add a default implementation here, just return __builtin_unreachable, and other page decoder's impelementation need not implement a default __builtin_unreachable again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r893055237


##########
be/src/olap/rowset/segment_v2/parsed_page.h:
##########
@@ -67,17 +70,38 @@ struct ParsedPage {
         return Status::OK();
     }
 
-    ~ParsedPage() {
-        delete data_decoder;
-        data_decoder = nullptr;
+    ParsedPage() = default;
+
+    ParsedPage(const ParsedPage& other) {
+        page_handle = other.page_handle;
+        has_null = other.has_null;
+        null_decoder = other.null_decoder;
+        null_bitmap = other.null_bitmap;
+        other.data_decoder->clone_for_cache(data_decoder);
+        first_ordinal = other.first_ordinal;
+        num_rows = other.num_rows;
+        first_array_item_ordinal = other.first_array_item_ordinal;
+        page_pointer = other.page_pointer;
+        page_index = other.page_index;
+        offset_in_page = other.offset_in_page;
     }
 
-    PageHandle page_handle;
+    ParsedPage& operator=(ParsedPage&& other) {
+        this->~ParsedPage();
+        new (this) ParsedPage(other);
+        return *this;
+    }
+
+    ~ParsedPage() { data_decoder = nullptr; }

Review Comment:
   why set it to nullptr explictly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#issuecomment-1152017066

   duplicate with https://github.com/apache/incubator-doris/pull/10036, I will close it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei closed pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei closed pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding
URL: https://github.com/apache/incubator-doris/pull/10016


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r893055475


##########
be/src/olap/rowset/segment_v2/parsed_page.h:
##########
@@ -67,17 +70,38 @@ struct ParsedPage {
         return Status::OK();
     }
 
-    ~ParsedPage() {
-        delete data_decoder;
-        data_decoder = nullptr;
+    ParsedPage() = default;
+
+    ParsedPage(const ParsedPage& other) {
+        page_handle = other.page_handle;
+        has_null = other.has_null;
+        null_decoder = other.null_decoder;
+        null_bitmap = other.null_bitmap;
+        other.data_decoder->clone_for_cache(data_decoder);
+        first_ordinal = other.first_ordinal;
+        num_rows = other.num_rows;
+        first_array_item_ordinal = other.first_array_item_ordinal;
+        page_pointer = other.page_pointer;
+        page_index = other.page_index;
+        offset_in_page = other.offset_in_page;
     }
 
-    PageHandle page_handle;
+    ParsedPage& operator=(ParsedPage&& other) {
+        this->~ParsedPage();
+        new (this) ParsedPage(other);
+        return *this;
+    }
+
+    ~ParsedPage() { data_decoder = nullptr; }
+
+    size_t size() const { return sizeof(ParsedPage) + page_handle->data().size; }

Review Comment:
   should also add chunk size in decoder?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] Gabriel39 commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
Gabriel39 commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r892117835


##########
be/src/olap/rowset/segment_v2/column_reader.h:
##########
@@ -298,10 +298,10 @@ class FileColumnIterator final : public ColumnIterator {
     ParsedPage _page;
 
     // keep dict page decoder
-    std::unique_ptr<PageDecoder> _dict_decoder;
+    std::shared_ptr<BinaryPlainPageDecoder> _dict_decoder;
 
     // keep dict page handle to avoid released
-    PageHandle _dict_page_handle;
+    std::shared_ptr<PageHandle> _dict_page_handle;

Review Comment:
   Could this variable be deleted safely? I notice that this variable will be set to a new value each time so I think it's unnecessary to use as a member variable.



##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -679,22 +702,29 @@ Status FileColumnIterator::_read_data_page(const OrdinalPageIndexIterator& iter)
                 Slice dict_data;
                 PageFooterPB dict_footer;
                 _opts.type = INDEX_PAGE;
+                _dict_page_handle = std::make_shared<PageHandle>();
                 RETURN_IF_ERROR(_reader->read_page(_opts, _reader->get_dict_page_pointer(),
-                                                   &_dict_page_handle, &dict_data, &dict_footer,
-                                                   _compress_codec.get()));
+                                                   _dict_page_handle.get(), &dict_data,
+                                                   &dict_footer, _compress_codec.get()));
                 // ignore dict_footer.dict_page_footer().encoding() due to only
                 // PLAIN_ENCODING is supported for dict page right now
-                _dict_decoder = std::make_unique<BinaryPlainPageDecoder>(dict_data);
+                _dict_decoder = std::make_shared<BinaryPlainPageDecoder>(dict_data);
                 RETURN_IF_ERROR(_dict_decoder->init());
 
-                auto* pd_decoder = (BinaryPlainPageDecoder*)_dict_decoder.get();
+                auto* pd_decoder = _dict_decoder.get();
                 _dict_word_info.reset(new StringRef[pd_decoder->_num_elems]);
                 pd_decoder->get_dict_word_info(_dict_word_info.get());
+                _page.dict_page_handle = _dict_page_handle;
             }
 
-            dict_page_decoder->set_dict_decoder(_dict_decoder.get(), _dict_word_info.get());
+            dict_page_decoder->set_dict_decoder(_dict_decoder, _dict_word_info);
         }
     }
+
+    if (_need_cache_parsed_page) {
+        ParsedPage* cache_page = new ParsedPage(_page);

Review Comment:
   Why we need to clone a cache page? Looks like we have already cached this page and don't need to duplicate another one



##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -468,6 +468,12 @@ FileColumnIterator::FileColumnIterator(ColumnReader* reader) : _reader(reader) {
 Status FileColumnIterator::init(const ColumnIteratorOptions& opts) {
     _opts = opts;
     RETURN_IF_ERROR(get_block_compression_codec(_reader->get_compression(), _compress_codec));
+    auto encoding = _reader->encoding_info()->encoding();
+    _need_cache_parsed_page =

Review Comment:
   should we add a new config to enable or disable this feature here? If budget for page cache is poor, I'm worried that it will lead to serious performance fallback as we cache a totally parsed page.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r893059900


##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -659,20 +662,33 @@ Status FileColumnIterator::_read_data_page(const OrdinalPageIndexIterator& iter)
     Slice page_body;
     PageFooterPB footer;
     _opts.type = DATA_PAGE;
-    RETURN_IF_ERROR(_reader->read_page(_opts, iter.page(), &handle, &page_body, &footer,
-                                       _compress_codec.get()));
-    // parse data page
-    RETURN_IF_ERROR(ParsedPage::create(std::move(handle), page_body, footer.data_page_footer(),
-                                       _reader->encoding_info(), iter.page(), iter.page_index(),
-                                       &_page));
+    auto reading_data_opts = _opts;

Review Comment:
   You'd better add a comment here. Because you copied opts and opts has stats, if it is not a ptr, then something may wrong. But currently, it is right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r893055097


##########
be/src/olap/rowset/segment_v2/parsed_page.h:
##########
@@ -67,17 +70,38 @@ struct ParsedPage {
         return Status::OK();
     }
 
-    ~ParsedPage() {
-        delete data_decoder;
-        data_decoder = nullptr;
+    ParsedPage() = default;
+
+    ParsedPage(const ParsedPage& other) {
+        page_handle = other.page_handle;
+        has_null = other.has_null;
+        null_decoder = other.null_decoder;
+        null_bitmap = other.null_bitmap;
+        other.data_decoder->clone_for_cache(data_decoder);
+        first_ordinal = other.first_ordinal;
+        num_rows = other.num_rows;
+        first_array_item_ordinal = other.first_array_item_ordinal;
+        page_pointer = other.page_pointer;
+        page_index = other.page_index;
+        offset_in_page = other.offset_in_page;
     }
 
-    PageHandle page_handle;
+    ParsedPage& operator=(ParsedPage&& other) {
+        this->~ParsedPage();
+        new (this) ParsedPage(other);
+        return *this;
+    }
+
+    ~ParsedPage() { data_decoder = nullptr; }

Review Comment:
   这么写的目的是啥?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] yiguolei commented on a diff in pull request #10016: [enhancement]Cache parsed page data for bitshuffle encoding

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10016:
URL: https://github.com/apache/incubator-doris/pull/10016#discussion_r892256439


##########
be/src/olap/rowset/segment_v2/column_reader.cpp:
##########
@@ -679,22 +702,29 @@ Status FileColumnIterator::_read_data_page(const OrdinalPageIndexIterator& iter)
                 Slice dict_data;
                 PageFooterPB dict_footer;
                 _opts.type = INDEX_PAGE;
+                _dict_page_handle = std::make_shared<PageHandle>();

Review Comment:
   这里如果不改动,那么相当于每个iterator 会读取一次dict page,然后生成新的word info 与 dict decoder, 这个代价很低,但是我们的代码结构会很简单,我觉得这次先可以采取这个策略。



##########
be/src/olap/rowset/segment_v2/binary_dict_page.cpp:
##########
@@ -313,5 +315,23 @@ Status BinaryDictPageDecoder::next_batch(size_t* n, ColumnBlockView* dst) {
     return Status::OK();
 }
 
+BinaryDictPageDecoder* BinaryDictPageDecoder::clone_for_cache() const {
+    BinaryDictPageDecoder* new_one = new BinaryDictPageDecoder(_data, _options);
+    new_one->_parsed = true;
+    new_one->_encoding_type = _encoding_type;
+    if (_encoding_type == DICT_ENCODING) {
+        DCHECK(_bit_shuffle_ptr != nullptr);
+        ColumnVectorBatch::create(0, false, get_scalar_type_info<OLAP_FIELD_TYPE_INT>(), nullptr,

Review Comment:
   _batch is only used in non-vec engine, I think we could ignore it. And we could add a check in column_reader only open parsed page cache during vec-engine.



##########
be/src/olap/rowset/segment_v2/page_decoder.h:
##########
@@ -102,6 +102,8 @@ class PageDecoder {
 
     bool has_remaining() const { return current_index() < count(); }
 
+    virtual PageDecoder* clone_for_cache() const = 0;

Review Comment:
   可能 unreachable 是不对的,他只是一个hint, 我觉得我们得直接 LOG(FATAL)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org