You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "cnissnzg (via GitHub)" <gi...@apache.org> on 2023/04/26 06:44:19 UTC

[GitHub] [doris] cnissnzg opened a new pull request, #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

cnissnzg opened a new pull request, #19099:
URL: https://github.com/apache/doris/pull/19099

   # Proposed changes
   
   Issue Number: #18917 
   
   ## Problem summary
   
   add sort and move aggregate code from _insert_one_row_fron_block to _collect_vskiplist_results
   
   ## Checklist(Required)
   
   * [x] Does it affect the original behavior
   * [x] Has unit tests been added
   * [x] Has document been added or modified
   * [x] Does it need to update dependencies
   * [x] Is this PR support rollback (If NO, please explain WHY)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1522919470

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1548165105

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558730023

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 32.98 seconds
    stream load tsv:          429 seconds loaded 74807831229 Bytes, about 166 MB/s
    stream load json:         22 seconds loaded 2358488459 Bytes, about 102 MB/s
    stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
    stream load parquet:          30 seconds loaded 861443392 Bytes, about 27 MB/s
    insert into select:          76.1 seconds inserted 10000000 Rows, about 131K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230523075531_clickbench_pr_148803.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201597555


##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
     // and use the ids to load segment data file for calc delete bitmap.
     int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
     RETURN_IF_ERROR(_do_flush(duration_ns));
+    _delta_writer_callback(_merged_rows);

Review Comment:
   Why do we need this callback?
   
   AFAIK, `_merged_rows` is already increased in `DeltaWriter::_flush_memtable_async()`.
   I suppose  `_mem_table->merged_rows()` has not been updated here?
   
   ```cpp
   Status DeltaWriter::_flush_memtable_async() {
       _merged_rows += _mem_table->merged_rows();
       return _flush_token->submit(std::move(_mem_table));
   }
   ```
   
   The callstack (one of the code paths):
   
   ```cpp
   DeltaWriter::_flush_memtable_async();
   => FlushToken::submit();
   => FlushToken::_flush_memtable();
   => new MemtableFlushTask();
   => MemtableFlushTask::run();
   => MemTable::flush(); // in the latest master, equivalent to MemTable::need_flush()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1552683256

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558318382

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556192310

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554601001

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1538544445

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1555901649

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1200211199


##########
be/src/olap/memtable.cpp:
##########
@@ -262,108 +228,168 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
     }
     // dst is non-sequence row, or dst sequence is smaller
     for (uint32_t cid = _schema->num_key_columns(); cid < _num_columns; ++cid) {
-        auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+        auto col_ptr = mutable_block.mutable_columns()[cid].get();
         _agg_functions[cid]->add(row_in_skiplist->agg_places(cid),
                                  const_cast<const doris::vectorized::IColumn**>(&col_ptr),
                                  new_row->_row_pos, nullptr);
     }
 }
-template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    if (_keys_type == KeysType::DUP_KEYS) {
-        if (_schema->num_key_columns() > 0) {
-            _collect_dup_table_with_keys();
-        } else {
-            // skip sort if the table is dup table without keys
-            _collect_dup_table_without_keys();
-        }
-    } else {
-        VecTable::Iterator it(_vec_skip_list.get());
-        vectorized::Block in_block = _input_mutable_block.to_block();
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            auto& block_data = in_block.get_columns_with_type_and_name();
-            // move key columns
-            for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
-                _output_mutable_block.get_column_by_position(i)->insert_from(
-                        *block_data[i].column.get(), it.key()->_row_pos);
-            }
-            // get value columns from agg_places
-            for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
-                auto function = _agg_functions[i];
-                auto agg_place = it.key()->agg_places(i);
-                auto col_ptr = _output_mutable_block.get_column_by_position(i).get();
-                function->insert_result_into(agg_place, *col_ptr);
-                if constexpr (is_final) {
-                    function->destroy(agg_place);
+void MemTable::prepare_block_for_flush(vectorized::Block& in_block) {
+    if (_keys_type == KeysType::DUP_KEYS && _schema->num_key_columns() == 0) {
+        // skip sort if the table is dup table without keys
+        _output_mutable_block.swap(_input_mutable_block);
+        return;
+    }
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort() {
+    _vec_row_comparator->set_block(&_input_mutable_block);
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    bool is_dup = (_keys_type == KeysType::DUP_KEYS);
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, is_dup, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                  auto value = (*(this->_vec_row_comparator))(l, r);
+                  if (value == 0) {
+                      same_keys_num++;
+                      return is_dup ? l->_row_pos > r->_row_pos : l->_row_pos < r->_row_pos;
+                  } else {
+                      return value < 0;
+                  }
+              });

Review Comment:
   Save this lambda or use a named function to avoid repeating it twice?
   Just for better readability.



##########
be/src/olap/memtable.h:
##########
@@ -178,12 +186,18 @@ class MemTable {
     //for vectorized
     vectorized::MutableBlock _input_mutable_block;
     vectorized::MutableBlock _output_mutable_block;
+    size_t _last_sorted_pos = 0;
 
+    //return number of same keys
+    int _sort();
+    template <bool is_final>
+    void finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
+                          int row_pos);
     template <bool is_final>
-    void _collect_vskiplist_results();
-    void _collect_dup_table_with_keys();
-    void _collect_dup_table_without_keys();
+    void _aggregate();
+    void prepare_block_for_flush(vectorized::Block& in_block);

Review Comment:
   ```suggestion
       void _prepare_block_for_flush(vectorized::Block& in_block);
   ```



##########
be/src/olap/memtable.h:
##########
@@ -178,12 +186,18 @@ class MemTable {
     //for vectorized
     vectorized::MutableBlock _input_mutable_block;
     vectorized::MutableBlock _output_mutable_block;
+    size_t _last_sorted_pos = 0;
 
+    //return number of same keys
+    int _sort();
+    template <bool is_final>
+    void finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,

Review Comment:
   Private method name should starts with an underscore
    
   ```suggestion
       void _finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
   ```



##########
be/src/olap/memtable.h:
##########
@@ -178,12 +186,18 @@ class MemTable {
     //for vectorized
     vectorized::MutableBlock _input_mutable_block;
     vectorized::MutableBlock _output_mutable_block;
+    size_t _last_sorted_pos = 0;
 
+    //return number of same keys
+    int _sort();
+    template <bool is_final>
+    void finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
+                          int row_pos);
     template <bool is_final>
-    void _collect_vskiplist_results();
-    void _collect_dup_table_with_keys();
-    void _collect_dup_table_without_keys();
+    void _aggregate();
+    void prepare_block_for_flush(vectorized::Block& in_block);
     bool _is_first_insertion;
+    std::function<void(int64_t)> delta_writer_callback;

Review Comment:
   ```suggestion
       std::function<void(int64_t)> _delta_writer_callback;
   ```



##########
be/src/olap/memtable.h:
##########
@@ -113,8 +123,8 @@ class MemTable {
 

Review Comment:
   This can be removed.
   
   ```cpp
   private:
       using VecTable = SkipList<RowInBlock*, RowInBlockComparator>;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556228998

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556191671

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558696513

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191833319


##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+    vectorized::Block in_block = _input_mutable_block.to_block();
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                  auto value = (*(this->_vec_row_comparator))(l, r);
+                  if (value == 0) {
+                      same_keys_num++;
+                      return l->_row_pos > r->_row_pos;
+                  } else {
+                      return value < 0;
+                  }
+              });
+    // merge new rows and old rows
+    std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+                       [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                           auto value = (*(this->_vec_row_comparator))(l, r);
+                           if (value == 0) {
+                               same_keys_num++;
+                               return l->_row_pos > r->_row_pos;
+                           } else {
+                               return value < 0;
+                           }
+                       });
+    _last_sorted_pos = _row_in_blocks.size();
+    return same_keys_num;
+}
+
 template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
     vectorized::Block in_block = _input_mutable_block.to_block();
-    if (_keys_type == KeysType::DUP_KEYS) {
-        vectorized::MutableBlock mutable_block =
-                vectorized::MutableBlock::build_mutable_block(&in_block);
-        _vec_row_comparator->set_block(&mutable_block);
-        std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
-                  [this](const RowInBlock* l, const RowInBlock* r) -> bool {
-                      auto value = (*(this->_vec_row_comparator))(l, r);
-                      if (value == 0) {
-                          return l->_row_pos > r->_row_pos;
-                      } else {
-                          return value < 0;
-                      }
-                  });
-        std::vector<int> row_pos_vec;
-        DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
-        row_pos_vec.reserve(in_block.rows());
-        for (int i = 0; i < _row_in_blocks.size(); i++) {
-            row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
-        }
-        _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
-                                       row_pos_vec.data() + in_block.rows());
-    } else {
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            auto& block_data = in_block.get_columns_with_type_and_name();
-            // move key columns
-            for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
-                _output_mutable_block.get_column_by_position(i)->insert_from(
-                        *block_data[i].column.get(), it.key()->_row_pos);
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    std::vector<RowInBlock*> temp_row_in_blocks;
+    std::vector<RowInBlock*> agg_row_in_temp;
+    RowInBlock* prev_row;
+    temp_row_in_blocks.reserve(_last_sorted_pos);
+    agg_row_in_temp.reserve(_last_sorted_pos);
+    bool need_init_agg = true;
+    //only init agg if needed
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        if (!temp_row_in_blocks.empty() &&
+            (*_vec_row_comparator)(prev_row, _row_in_blocks[i]) == 0) {
+            if (need_init_agg) {
+                prev_row->init_agg_places(
+                        _arena->aligned_alloc(_total_size_of_aggregate_states, 16),
+                        _offsets_of_aggregate_states.data());
+                for (auto cid = _schema->num_key_columns(); cid < _schema->num_columns(); cid++) {
+                    auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+                    auto data = prev_row->agg_places(cid);
+                    _agg_functions[cid]->create(data);
+                    _agg_functions[cid]->add(

Review Comment:
   _agg_functions may be inited in last call of merge_row_in_block. we should push init_agg as a member of MemTable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191833860


##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+    vectorized::Block in_block = _input_mutable_block.to_block();
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                  auto value = (*(this->_vec_row_comparator))(l, r);
+                  if (value == 0) {
+                      same_keys_num++;
+                      return l->_row_pos > r->_row_pos;
+                  } else {
+                      return value < 0;
+                  }
+              });
+    // merge new rows and old rows
+    std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+                       [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                           auto value = (*(this->_vec_row_comparator))(l, r);
+                           if (value == 0) {
+                               same_keys_num++;
+                               return l->_row_pos > r->_row_pos;
+                           } else {
+                               return value < 0;
+                           }
+                       });
+    _last_sorted_pos = _row_in_blocks.size();
+    return same_keys_num;
+}
+
 template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {

Review Comment:
   aggregate is a better name.



##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+    vectorized::Block in_block = _input_mutable_block.to_block();
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                  auto value = (*(this->_vec_row_comparator))(l, r);
+                  if (value == 0) {
+                      same_keys_num++;
+                      return l->_row_pos > r->_row_pos;
+                  } else {
+                      return value < 0;
+                  }
+              });
+    // merge new rows and old rows
+    std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+                       [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                           auto value = (*(this->_vec_row_comparator))(l, r);
+                           if (value == 0) {
+                               same_keys_num++;
+                               return l->_row_pos > r->_row_pos;
+                           } else {
+                               return value < 0;
+                           }
+                       });
+    _last_sorted_pos = _row_in_blocks.size();
+    return same_keys_num;
+}
+
 template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {

Review Comment:
   _aggregate is a better name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194735773


##########
be/src/olap/memtable.cpp:
##########
@@ -267,102 +230,152 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
-template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    if (_keys_type == KeysType::DUP_KEYS) {
-        if (_schema->num_key_columns() > 0) {
-            _collect_dup_table_with_keys();
-        } else {
-            // skip sort if the table is dup table without keys
-            _collect_dup_table_without_keys();
-        }
-    } else {
-        VecTable::Iterator it(_vec_skip_list.get());
-        vectorized::Block in_block = _input_mutable_block.to_block();
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            auto& block_data = in_block.get_columns_with_type_and_name();
-            // move key columns
-            for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
-                _output_mutable_block.get_column_by_position(i)->insert_from(
-                        *block_data[i].column.get(), it.key()->_row_pos);
-            }
-            // get value columns from agg_places
-            for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
-                auto function = _agg_functions[i];
-                auto agg_place = it.key()->agg_places(i);
-                auto col_ptr = _output_mutable_block.get_column_by_position(i).get();
-                function->insert_result_into(agg_place, *col_ptr);
-                if constexpr (is_final) {
-                    function->destroy(agg_place);
-                } else {
-                    function->reset(agg_place);
-                    function->add(agg_place,
-                                  const_cast<const doris::vectorized::IColumn**>(&col_ptr), idx,
-                                  nullptr);
-                }
-            }
-            if constexpr (!is_final) {
-                // re-index the row_pos in VSkipList
-                it.key()->_row_pos = idx;
-                idx++;
-            }
-        }
-        if constexpr (!is_final) {
-            // if is not final, we collect the agg results to input_block and then continue to insert
-            size_t shrunked_after_agg = _output_mutable_block.allocated_bytes();
-            // flush will not run here, so will not duplicate `_flush_mem_tracker`
-            _insert_mem_tracker->consume(shrunked_after_agg - _mem_usage);
-            _mem_usage = shrunked_after_agg;
-            _input_mutable_block.swap(_output_mutable_block);
-            //TODO(weixang):opt here.
-            std::unique_ptr<vectorized::Block> empty_input_block =
-                    in_block.create_same_struct_block(0);
-            _output_mutable_block =
-                    vectorized::MutableBlock::build_mutable_block(empty_input_block.get());
-            _output_mutable_block.clear_column_data();
-        }
-    }
-
-    if (is_final) {
-        _vec_skip_list.reset();
+void MemTable::prepare_block_for_flush(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
     }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());

Review Comment:
   we should skip memcopy for dupliate without keys like _collect_dup_table_without_keys does.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558362444

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 33.55 seconds
    stream load tsv:          421 seconds loaded 74807831229 Bytes, about 169 MB/s
    stream load json:         22 seconds loaded 2358488459 Bytes, about 102 MB/s
    stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
    stream load parquet:          30 seconds loaded 861443392 Bytes, about 27 MB/s
    insert into select:          75.3 seconds inserted 10000000 Rows, about 132K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230523014327_clickbench_pr_148572.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554304747

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] cnissnzg commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201660560


##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
     // and use the ids to load segment data file for calc delete bitmap.
     int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
     RETURN_IF_ERROR(_do_flush(duration_ns));
+    _delta_writer_callback(_merged_rows);

Review Comment:
   Oh yes, I think we should remove this line in _flush_memtable_async.
   Now we only use _merged_rows to check correctness, and we can't get correct _merged_rows until flush has finished.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring merged pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring merged PR #19099:
URL: https://github.com/apache/doris/pull/19099


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1527108149

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1543595640

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1552679735

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201615144


##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
     // and use the ids to load segment data file for calc delete bitmap.
     int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
     RETURN_IF_ERROR(_do_flush(duration_ns));
+    _delta_writer_callback(_merged_rows);

Review Comment:
   OK I got it, `_mem_table->merged_rows()` was updated in `MemTable::_do_flush()`.
   
   ```cpp
       RETURN_IF_ERROR(_do_flush(duration_ns));
       delta_writer_callback(_merged_rows);
   ```
   
   The question is, will it be added twice?
   
   ```cpp
   // in DeltaWriter::_flush_memtable_async()
       _merged_rows += _mem_table->merged_rows(); // what if merged_rows() is non-zero here?
   
   // in MemTable::flush() or MemTable::need_flush()
       _delta_writer_callback(_merged_rows)
   ```
   
   I wonder if there is a better way to make this more clear to read.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1559034498

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 32.96 seconds
    stream load tsv:          429 seconds loaded 74807831229 Bytes, about 166 MB/s
    stream load json:         23 seconds loaded 2358488459 Bytes, about 97 MB/s
    stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
    stream load parquet:          30 seconds loaded 861443392 Bytes, about 27 MB/s
    insert into select:          76.2 seconds inserted 10000000 Rows, about 131K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230523104117_clickbench_pr_148933.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1557025530

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191831840


##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {

Review Comment:
   Name _sort is enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191835438


##########
be/src/olap/memtable.cpp:
##########
@@ -144,19 +144,6 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
 }
 
 MemTable::~MemTable() {
-    if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
-        VecTable::Iterator it(_vec_skip_list.get());
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            // We should release agg_places here, because they are not released when a
-            // load is canceled.
-            for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
-                auto function = _agg_functions[i];
-                DCHECK(function != nullptr);
-                DCHECK(it.key()->agg_places(i) != nullptr);
-                function->destroy(it.key()->agg_places(i));

Review Comment:
   we should destroy agg functions, otherwise memleak happens.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1553971330

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556109222

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558688433

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556217650

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1549725821

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1527113691

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1536978322

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1548938109

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194725701


##########
be/src/olap/memtable.cpp:
##########
@@ -440,7 +453,12 @@ Status MemTable::flush() {
 Status MemTable::_do_flush(int64_t& duration_ns) {
     SCOPED_CONSUME_MEM_TRACKER(_flush_mem_tracker);
     SCOPED_RAW_TIMER(&duration_ns);
-    _collect_vskiplist_results<true>();
+    _sort();
+    if (_keys_type == KeysType::DUP_KEYS) {
+        vectorized::Block in_block = _input_mutable_block.to_block();
+        prepare_block_for_flush(in_block);
+    }
+    { _aggregate<true>(); }

Review Comment:
   else is expected?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194722810


##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,15 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
 }
 
 MemTable::~MemTable() {
-    if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
-        VecTable::Iterator it(_vec_skip_list.get());
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
+    if (!_row_in_blocks.empty() && _keys_type != KeysType::DUP_KEYS) {

Review Comment:
   _row_in_blocks.emtry is redunt.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194722810


##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,15 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
 }
 
 MemTable::~MemTable() {
-    if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
-        VecTable::Iterator it(_vec_skip_list.get());
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
+    if (!_row_in_blocks.empty() && _keys_type != KeysType::DUP_KEYS) {

Review Comment:
   _row_in_blocks.emtry is redudant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194769944


##########
be/src/olap/memtable.cpp:
##########
@@ -267,102 +230,152 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
-template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    if (_keys_type == KeysType::DUP_KEYS) {
-        if (_schema->num_key_columns() > 0) {
-            _collect_dup_table_with_keys();
-        } else {
-            // skip sort if the table is dup table without keys
-            _collect_dup_table_without_keys();
-        }
-    } else {
-        VecTable::Iterator it(_vec_skip_list.get());
-        vectorized::Block in_block = _input_mutable_block.to_block();
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            auto& block_data = in_block.get_columns_with_type_and_name();
-            // move key columns
-            for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
-                _output_mutable_block.get_column_by_position(i)->insert_from(
-                        *block_data[i].column.get(), it.key()->_row_pos);
-            }
-            // get value columns from agg_places
-            for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
-                auto function = _agg_functions[i];
-                auto agg_place = it.key()->agg_places(i);
-                auto col_ptr = _output_mutable_block.get_column_by_position(i).get();
-                function->insert_result_into(agg_place, *col_ptr);
-                if constexpr (is_final) {
-                    function->destroy(agg_place);
-                } else {
-                    function->reset(agg_place);
-                    function->add(agg_place,
-                                  const_cast<const doris::vectorized::IColumn**>(&col_ptr), idx,
-                                  nullptr);
-                }
-            }
-            if constexpr (!is_final) {
-                // re-index the row_pos in VSkipList
-                it.key()->_row_pos = idx;
-                idx++;
-            }
-        }
-        if constexpr (!is_final) {
-            // if is not final, we collect the agg results to input_block and then continue to insert
-            size_t shrunked_after_agg = _output_mutable_block.allocated_bytes();
-            // flush will not run here, so will not duplicate `_flush_mem_tracker`
-            _insert_mem_tracker->consume(shrunked_after_agg - _mem_usage);
-            _mem_usage = shrunked_after_agg;
-            _input_mutable_block.swap(_output_mutable_block);
-            //TODO(weixang):opt here.
-            std::unique_ptr<vectorized::Block> empty_input_block =
-                    in_block.create_same_struct_block(0);
-            _output_mutable_block =
-                    vectorized::MutableBlock::build_mutable_block(empty_input_block.get());
-            _output_mutable_block.clear_column_data();
-        }
-    }
-
-    if (is_final) {
-        _vec_skip_list.reset();
+void MemTable::prepare_block_for_flush(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
     }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
 }
-
-void MemTable::_collect_dup_table_with_keys() {
+int MemTable::_sort() {
     vectorized::Block in_block = _input_mutable_block.to_block();
     vectorized::MutableBlock mutable_block =
             vectorized::MutableBlock::build_mutable_block(&in_block);
     _vec_row_comparator->set_block(&mutable_block);
-    std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
-              [this](const RowInBlock* l, const RowInBlock* r) -> bool {
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
                   auto value = (*(this->_vec_row_comparator))(l, r);
                   if (value == 0) {
+                      same_keys_num++;
                       return l->_row_pos > r->_row_pos;
                   } else {
                       return value < 0;
                   }
               });
-    std::vector<int> row_pos_vec;
-    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
-    row_pos_vec.reserve(in_block.rows());
-    for (int i = 0; i < _row_in_blocks.size(); i++) {
-        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    // merge new rows and old rows
+    std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+                       [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                           auto value = (*(this->_vec_row_comparator))(l, r);
+                           if (value == 0) {
+                               same_keys_num++;
+                               return l->_row_pos > r->_row_pos;
+                           } else {
+                               return value < 0;
+                           }
+                       });
+    _last_sorted_pos = _row_in_blocks.size();
+    return same_keys_num;
+}
+
+template <bool is_final>
+void MemTable::_aggregate_one_row(RowInBlock* row,
+                                  const vectorized::ColumnsWithTypeAndName& block_data,

Review Comment:
   finalize_one_row is a better name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1195870191


##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,14 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
 }
 
 MemTable::~MemTable() {
-    if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
-        VecTable::Iterator it(_vec_skip_list.get());
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
+    if (_keys_type != KeysType::DUP_KEYS) {
+        for (auto it = _row_in_blocks.begin(); it != _row_in_blocks.end(); it++) {

Review Comment:
   if (!it->has_init_agg()) {
       continue;
   }



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556227736

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1555905698

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 36.15 seconds
    stream load tsv:          433 seconds loaded 74807831229 Bytes, about 164 MB/s
    stream load json:         24 seconds loaded 2358488459 Bytes, about 93 MB/s
    stream load orc:          60 seconds loaded 1101869774 Bytes, about 17 MB/s
    stream load parquet:          31 seconds loaded 861443392 Bytes, about 26 MB/s
    insert into select:          78.4 seconds inserted 10000000 Rows, about 127K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230520125426_clickbench_pr_147718.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1527114742

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1536985861

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191832160


##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {

Review Comment:
   prepare_block_for_flush is a better name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191833538


##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+    vectorized::Block in_block = _input_mutable_block.to_block();
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                  auto value = (*(this->_vec_row_comparator))(l, r);
+                  if (value == 0) {
+                      same_keys_num++;
+                      return l->_row_pos > r->_row_pos;
+                  } else {
+                      return value < 0;
+                  }
+              });
+    // merge new rows and old rows
+    std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+                       [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                           auto value = (*(this->_vec_row_comparator))(l, r);
+                           if (value == 0) {
+                               same_keys_num++;
+                               return l->_row_pos > r->_row_pos;
+                           } else {
+                               return value < 0;
+                           }
+                       });
+    _last_sorted_pos = _row_in_blocks.size();
+    return same_keys_num;
+}
+
 template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
     vectorized::Block in_block = _input_mutable_block.to_block();
-    if (_keys_type == KeysType::DUP_KEYS) {
-        vectorized::MutableBlock mutable_block =
-                vectorized::MutableBlock::build_mutable_block(&in_block);
-        _vec_row_comparator->set_block(&mutable_block);
-        std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
-                  [this](const RowInBlock* l, const RowInBlock* r) -> bool {
-                      auto value = (*(this->_vec_row_comparator))(l, r);
-                      if (value == 0) {
-                          return l->_row_pos > r->_row_pos;
-                      } else {
-                          return value < 0;
-                      }
-                  });
-        std::vector<int> row_pos_vec;
-        DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
-        row_pos_vec.reserve(in_block.rows());
-        for (int i = 0; i < _row_in_blocks.size(); i++) {
-            row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
-        }
-        _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
-                                       row_pos_vec.data() + in_block.rows());
-    } else {
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            auto& block_data = in_block.get_columns_with_type_and_name();
-            // move key columns
-            for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
-                _output_mutable_block.get_column_by_position(i)->insert_from(
-                        *block_data[i].column.get(), it.key()->_row_pos);
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    std::vector<RowInBlock*> temp_row_in_blocks;
+    std::vector<RowInBlock*> agg_row_in_temp;
+    RowInBlock* prev_row;
+    temp_row_in_blocks.reserve(_last_sorted_pos);
+    agg_row_in_temp.reserve(_last_sorted_pos);
+    bool need_init_agg = true;
+    //only init agg if needed
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        if (!temp_row_in_blocks.empty() &&
+            (*_vec_row_comparator)(prev_row, _row_in_blocks[i]) == 0) {
+            if (need_init_agg) {
+                prev_row->init_agg_places(
+                        _arena->aligned_alloc(_total_size_of_aggregate_states, 16),
+                        _offsets_of_aggregate_states.data());
+                for (auto cid = _schema->num_key_columns(); cid < _schema->num_columns(); cid++) {
+                    auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+                    auto data = prev_row->agg_places(cid);
+                    _agg_functions[cid]->create(data);
+                    _agg_functions[cid]->add(

Review Comment:
   And we put init code in a function called init_agg_functions()



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556820324

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556125190

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550616348

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550769631

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550895246

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554172274

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558989286

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1539318519

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191834809


##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
                                  new_row->_row_pos, nullptr);
     }
 }
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+    std::vector<int> row_pos_vec;
+    DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+    row_pos_vec.reserve(in_block.rows());
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+    }
+    _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+                                   row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+    vectorized::Block in_block = _input_mutable_block.to_block();
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+    size_t same_keys_num = 0;
+    // sort new rows
+    std::sort(new_row_it, _row_in_blocks.end(),
+              [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                  auto value = (*(this->_vec_row_comparator))(l, r);
+                  if (value == 0) {
+                      same_keys_num++;
+                      return l->_row_pos > r->_row_pos;
+                  } else {
+                      return value < 0;
+                  }
+              });
+    // merge new rows and old rows
+    std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+                       [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+                           auto value = (*(this->_vec_row_comparator))(l, r);
+                           if (value == 0) {
+                               same_keys_num++;
+                               return l->_row_pos > r->_row_pos;
+                           } else {
+                               return value < 0;
+                           }
+                       });
+    _last_sorted_pos = _row_in_blocks.size();
+    return same_keys_num;
+}
+
 template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
-    VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
     vectorized::Block in_block = _input_mutable_block.to_block();
-    if (_keys_type == KeysType::DUP_KEYS) {
-        vectorized::MutableBlock mutable_block =
-                vectorized::MutableBlock::build_mutable_block(&in_block);
-        _vec_row_comparator->set_block(&mutable_block);
-        std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
-                  [this](const RowInBlock* l, const RowInBlock* r) -> bool {
-                      auto value = (*(this->_vec_row_comparator))(l, r);
-                      if (value == 0) {
-                          return l->_row_pos > r->_row_pos;
-                      } else {
-                          return value < 0;
-                      }
-                  });
-        std::vector<int> row_pos_vec;
-        DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
-        row_pos_vec.reserve(in_block.rows());
-        for (int i = 0; i < _row_in_blocks.size(); i++) {
-            row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
-        }
-        _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
-                                       row_pos_vec.data() + in_block.rows());
-    } else {
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
-            auto& block_data = in_block.get_columns_with_type_and_name();
-            // move key columns
-            for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
-                _output_mutable_block.get_column_by_position(i)->insert_from(
-                        *block_data[i].column.get(), it.key()->_row_pos);
+    vectorized::MutableBlock mutable_block =
+            vectorized::MutableBlock::build_mutable_block(&in_block);
+    _vec_row_comparator->set_block(&mutable_block);
+    std::vector<RowInBlock*> temp_row_in_blocks;
+    std::vector<RowInBlock*> agg_row_in_temp;
+    RowInBlock* prev_row;
+    temp_row_in_blocks.reserve(_last_sorted_pos);
+    agg_row_in_temp.reserve(_last_sorted_pos);
+    bool need_init_agg = true;
+    //only init agg if needed
+    for (int i = 0; i < _row_in_blocks.size(); i++) {
+        if (!temp_row_in_blocks.empty() &&
+            (*_vec_row_comparator)(prev_row, _row_in_blocks[i]) == 0) {
+            if (need_init_agg) {
+                prev_row->init_agg_places(
+                        _arena->aligned_alloc(_total_size_of_aggregate_states, 16),
+                        _offsets_of_aggregate_states.data());
+                for (auto cid = _schema->num_key_columns(); cid < _schema->num_columns(); cid++) {
+                    auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+                    auto data = prev_row->agg_places(cid);
+                    _agg_functions[cid]->create(data);
+                    _agg_functions[cid]->add(
+                            data, const_cast<const doris::vectorized::IColumn**>(&col_ptr),
+                            prev_row->_row_pos, nullptr);
+                }
+                agg_row_in_temp.push_back(prev_row);
+                need_init_agg = false;
             }
+            _merged_rows++;
+            _aggregate_two_row_in_block(_row_in_blocks[i], prev_row);
+        } else {
+            prev_row = _row_in_blocks[i];
+            temp_row_in_blocks.push_back(prev_row);
+            need_init_agg = true;
+        }
+    }

Review Comment:
   we should stat merged rows in one run and we can optimize move operations below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194759204


##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,15 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
 }
 
 MemTable::~MemTable() {
-    if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
-        VecTable::Iterator it(_vec_skip_list.get());
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
+    if (!_row_in_blocks.empty() && _keys_type != KeysType::DUP_KEYS) {
+        for (auto it = _row_in_blocks.begin(); it != _row_in_blocks.end(); it++) {
             // We should release agg_places here, because they are not released when a
             // load is canceled.
             for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
                 auto function = _agg_functions[i];
                 DCHECK(function != nullptr);
-                DCHECK(it.key()->agg_places(i) != nullptr);
-                function->destroy(it.key()->agg_places(i));
+                DCHECK((*it)->agg_places(i) != nullptr);
+                function->destroy((*it)->agg_places(i));

Review Comment:
   If writes stop and just destruct then, there is no agg_places, dcheck here does not work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1549716280

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556229080

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201597555


##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
     // and use the ids to load segment data file for calc delete bitmap.
     int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
     RETURN_IF_ERROR(_do_flush(duration_ns));
+    _delta_writer_callback(_merged_rows);

Review Comment:
   Why do we need this callback?
   
   AFAIK, `_merged_rows` is already increased in 
   
   ```cpp
   Status DeltaWriter::_flush_memtable_async() {
       _merged_rows += _mem_table->merged_rows();
       return _flush_token->submit(std::move(_mem_table));
   }
   ```
   
   The callstack (one of the code paths):
   
   ```cpp
   DeltaWriter::_flush_memtable_async();
   => FlushToken::submit();
   => FlushToken::_flush_memtable();
   => new MemtableFlushTask();
   => MemtableFlushTask::run();
   => MemTable::flush(); // in the latest master, equivalent to MemTable::need_flush()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1553910586

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550895105

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1197710765


##########
be/src/olap/memtable.cpp:
##########
@@ -440,7 +463,13 @@ Status MemTable::flush() {
 Status MemTable::_do_flush(int64_t& duration_ns) {
     SCOPED_CONSUME_MEM_TRACKER(_flush_mem_tracker);
     SCOPED_RAW_TIMER(&duration_ns);
-    _collect_vskiplist_results<true>();
+    _sort();
+    if (_keys_type == KeysType::DUP_KEYS) {

Review Comment:
   int same_keys_num = _sort();
   if (_keys_type == KeysType::DUP_KEYS || same_keys_num == 0)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556129856

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 36.22 seconds
    stream load tsv:          430 seconds loaded 74807831229 Bytes, about 165 MB/s
    stream load json:         21 seconds loaded 2358488459 Bytes, about 107 MB/s
    stream load orc:          61 seconds loaded 1101869774 Bytes, about 17 MB/s
    stream load parquet:          30 seconds loaded 861443392 Bytes, about 27 MB/s
    insert into select:          78.0 seconds inserted 10000000 Rows, about 128K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230521092756_clickbench_pr_147843.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1184600386


##########
be/src/olap/memtable.cpp:
##########
@@ -286,32 +292,90 @@ void MemTable::_collect_vskiplist_results() {
         _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
                                        row_pos_vec.data() + in_block.rows());
     } else {
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
+        size_t idx = 0, temp_idx = 0;
+        vectorized::MutableBlock mutable_block =
+                vectorized::MutableBlock::build_mutable_block(&in_block);
+        _vec_row_comparator->set_block(&mutable_block);
+        auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+        // sort new rows
+        std::sort(new_row_it, _row_in_blocks.end(),
+                  [this](const RowInBlock* l, const RowInBlock* r) -> bool {
+                      auto value = (*(this->_vec_row_comparator))(l, r);
+                      if (value == 0) {
+                          return l->_row_pos > r->_row_pos;

Review Comment:
   add a stats here, e.g. same_keys_num++;



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1184600830


##########
be/src/olap/memtable.cpp:
##########
@@ -286,32 +292,90 @@ void MemTable::_collect_vskiplist_results() {
         _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
                                        row_pos_vec.data() + in_block.rows());
     } else {
-        size_t idx = 0;
-        for (it.SeekToFirst(); it.Valid(); it.Next()) {
+        size_t idx = 0, temp_idx = 0;
+        vectorized::MutableBlock mutable_block =
+                vectorized::MutableBlock::build_mutable_block(&in_block);
+        _vec_row_comparator->set_block(&mutable_block);
+        auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+        // sort new rows
+        std::sort(new_row_it, _row_in_blocks.end(),
+                  [this](const RowInBlock* l, const RowInBlock* r) -> bool {
+                      auto value = (*(this->_vec_row_comparator))(l, r);
+                      if (value == 0) {
+                          return l->_row_pos > r->_row_pos;
+                      } else {
+                          return value < 0;
+                      }
+                  });
+        // merge new rows and old rows
+        std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end());
+        _last_sorted_pos = _row_in_blocks.size();
+        std::vector<RowInBlock*> temp_row_in_blocks(_last_sorted_pos);
+        std::vector<size_t> need_merge_idxs; // use bitset?
+        bool need_init_agg = true;

Review Comment:
   if (same_keys_num == 0) skip agg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1184606638


##########
be/src/olap/memtable.cpp:
##########
@@ -21,10 +21,13 @@
 #include <gen_cpp/olap_file.pb.h>
 
 #include <algorithm>
+#include <bitset>

Review Comment:
   bitset is not used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556218016

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] kaijchen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556617784

   Test result of stream load into unique key table with 3 replicas.
   
   | test suite | before this change | after this change |
   | --- | --- | --- |
   | 7.5GB, 1 bucket, single_replica_read = false | 160s | 130s |
   | 96GB, 960 bucket, single_replica_read = false | 6189s | 1360s |
   | 96GB, 960 bucket, single_replica_read = true | 2318s | 919s |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554500638

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys

Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201597555


##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
     // and use the ids to load segment data file for calc delete bitmap.
     int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
     RETURN_IF_ERROR(_do_flush(duration_ns));
+    _delta_writer_callback(_merged_rows);

Review Comment:
   Why do we need this callback?
   
   AFAIK, `_merged_rows` is already increased here.
   I suppose  `_mem_table->merged_rows()` has not been updated here?
   
   ```cpp
   Status DeltaWriter::_flush_memtable_async() {
       _merged_rows += _mem_table->merged_rows();
       return _flush_token->submit(std::move(_mem_table));
   }
   ```
   
   The callstack (one of the code paths):
   
   ```cpp
   DeltaWriter::_flush_memtable_async();
   => FlushToken::submit();
   => FlushToken::_flush_memtable();
   => new MemtableFlushTask();
   => MemtableFlushTask::run();
   => MemTable::flush(); // in the latest master, equivalent to MemTable::need_flush()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550687207

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1552685795

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550608252

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1522883703

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org