You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "cnissnzg (via GitHub)" <gi...@apache.org> on 2023/04/26 06:44:19 UTC
[GitHub] [doris] cnissnzg opened a new pull request, #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
cnissnzg opened a new pull request, #19099:
URL: https://github.com/apache/doris/pull/19099
# Proposed changes
Issue Number: #18917
## Problem summary
add sort and move aggregate code from _insert_one_row_fron_block to _collect_vskiplist_results
## Checklist(Required)
* [x] Does it affect the original behavior
* [x] Has unit tests been added
* [x] Has document been added or modified
* [x] Does it need to update dependencies
* [x] Is this PR support rollback (If NO, please explain WHY)
## Further comments
If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1522919470
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1548165105
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558730023
TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 32.98 seconds
stream load tsv: 429 seconds loaded 74807831229 Bytes, about 166 MB/s
stream load json: 22 seconds loaded 2358488459 Bytes, about 102 MB/s
stream load orc: 59 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 76.1 seconds inserted 10000000 Rows, about 131K ops/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230523075531_clickbench_pr_148803.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201597555
##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
// and use the ids to load segment data file for calc delete bitmap.
int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
RETURN_IF_ERROR(_do_flush(duration_ns));
+ _delta_writer_callback(_merged_rows);
Review Comment:
Why do we need this callback?
AFAIK, `_merged_rows` is already increased in `DeltaWriter::_flush_memtable_async()`.
I suppose `_mem_table->merged_rows()` has not been updated here?
```cpp
Status DeltaWriter::_flush_memtable_async() {
_merged_rows += _mem_table->merged_rows();
return _flush_token->submit(std::move(_mem_table));
}
```
The callstack (one of the code paths):
```cpp
DeltaWriter::_flush_memtable_async();
=> FlushToken::submit();
=> FlushToken::_flush_memtable();
=> new MemtableFlushTask();
=> MemtableFlushTask::run();
=> MemTable::flush(); // in the latest master, equivalent to MemTable::need_flush()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1552683256
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558318382
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556192310
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554601001
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1538544445
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1555901649
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1200211199
##########
be/src/olap/memtable.cpp:
##########
@@ -262,108 +228,168 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
}
// dst is non-sequence row, or dst sequence is smaller
for (uint32_t cid = _schema->num_key_columns(); cid < _num_columns; ++cid) {
- auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+ auto col_ptr = mutable_block.mutable_columns()[cid].get();
_agg_functions[cid]->add(row_in_skiplist->agg_places(cid),
const_cast<const doris::vectorized::IColumn**>(&col_ptr),
new_row->_row_pos, nullptr);
}
}
-template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- if (_keys_type == KeysType::DUP_KEYS) {
- if (_schema->num_key_columns() > 0) {
- _collect_dup_table_with_keys();
- } else {
- // skip sort if the table is dup table without keys
- _collect_dup_table_without_keys();
- }
- } else {
- VecTable::Iterator it(_vec_skip_list.get());
- vectorized::Block in_block = _input_mutable_block.to_block();
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- auto& block_data = in_block.get_columns_with_type_and_name();
- // move key columns
- for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
- _output_mutable_block.get_column_by_position(i)->insert_from(
- *block_data[i].column.get(), it.key()->_row_pos);
- }
- // get value columns from agg_places
- for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
- auto function = _agg_functions[i];
- auto agg_place = it.key()->agg_places(i);
- auto col_ptr = _output_mutable_block.get_column_by_position(i).get();
- function->insert_result_into(agg_place, *col_ptr);
- if constexpr (is_final) {
- function->destroy(agg_place);
+void MemTable::prepare_block_for_flush(vectorized::Block& in_block) {
+ if (_keys_type == KeysType::DUP_KEYS && _schema->num_key_columns() == 0) {
+ // skip sort if the table is dup table without keys
+ _output_mutable_block.swap(_input_mutable_block);
+ return;
+ }
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort() {
+ _vec_row_comparator->set_block(&_input_mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ bool is_dup = (_keys_type == KeysType::DUP_KEYS);
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, is_dup, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return is_dup ? l->_row_pos > r->_row_pos : l->_row_pos < r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
Review Comment:
Save this lambda or use a named function to avoid repeating it twice?
Just for better readability.
##########
be/src/olap/memtable.h:
##########
@@ -178,12 +186,18 @@ class MemTable {
//for vectorized
vectorized::MutableBlock _input_mutable_block;
vectorized::MutableBlock _output_mutable_block;
+ size_t _last_sorted_pos = 0;
+ //return number of same keys
+ int _sort();
+ template <bool is_final>
+ void finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
+ int row_pos);
template <bool is_final>
- void _collect_vskiplist_results();
- void _collect_dup_table_with_keys();
- void _collect_dup_table_without_keys();
+ void _aggregate();
+ void prepare_block_for_flush(vectorized::Block& in_block);
Review Comment:
```suggestion
void _prepare_block_for_flush(vectorized::Block& in_block);
```
##########
be/src/olap/memtable.h:
##########
@@ -178,12 +186,18 @@ class MemTable {
//for vectorized
vectorized::MutableBlock _input_mutable_block;
vectorized::MutableBlock _output_mutable_block;
+ size_t _last_sorted_pos = 0;
+ //return number of same keys
+ int _sort();
+ template <bool is_final>
+ void finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
Review Comment:
Private method name should starts with an underscore
```suggestion
void _finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
```
##########
be/src/olap/memtable.h:
##########
@@ -178,12 +186,18 @@ class MemTable {
//for vectorized
vectorized::MutableBlock _input_mutable_block;
vectorized::MutableBlock _output_mutable_block;
+ size_t _last_sorted_pos = 0;
+ //return number of same keys
+ int _sort();
+ template <bool is_final>
+ void finalize_one_row(RowInBlock* row, const vectorized::ColumnsWithTypeAndName& block_data,
+ int row_pos);
template <bool is_final>
- void _collect_vskiplist_results();
- void _collect_dup_table_with_keys();
- void _collect_dup_table_without_keys();
+ void _aggregate();
+ void prepare_block_for_flush(vectorized::Block& in_block);
bool _is_first_insertion;
+ std::function<void(int64_t)> delta_writer_callback;
Review Comment:
```suggestion
std::function<void(int64_t)> _delta_writer_callback;
```
##########
be/src/olap/memtable.h:
##########
@@ -113,8 +123,8 @@ class MemTable {
Review Comment:
This can be removed.
```cpp
private:
using VecTable = SkipList<RowInBlock*, RowInBlockComparator>;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556228998
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556191671
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558696513
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191833319
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+ vectorized::Block in_block = _input_mutable_block.to_block();
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ _last_sorted_pos = _row_in_blocks.size();
+ return same_keys_num;
+}
+
template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
vectorized::Block in_block = _input_mutable_block.to_block();
- if (_keys_type == KeysType::DUP_KEYS) {
- vectorized::MutableBlock mutable_block =
- vectorized::MutableBlock::build_mutable_block(&in_block);
- _vec_row_comparator->set_block(&mutable_block);
- std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
- [this](const RowInBlock* l, const RowInBlock* r) -> bool {
- auto value = (*(this->_vec_row_comparator))(l, r);
- if (value == 0) {
- return l->_row_pos > r->_row_pos;
- } else {
- return value < 0;
- }
- });
- std::vector<int> row_pos_vec;
- DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
- row_pos_vec.reserve(in_block.rows());
- for (int i = 0; i < _row_in_blocks.size(); i++) {
- row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
- }
- _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
- row_pos_vec.data() + in_block.rows());
- } else {
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- auto& block_data = in_block.get_columns_with_type_and_name();
- // move key columns
- for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
- _output_mutable_block.get_column_by_position(i)->insert_from(
- *block_data[i].column.get(), it.key()->_row_pos);
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ std::vector<RowInBlock*> temp_row_in_blocks;
+ std::vector<RowInBlock*> agg_row_in_temp;
+ RowInBlock* prev_row;
+ temp_row_in_blocks.reserve(_last_sorted_pos);
+ agg_row_in_temp.reserve(_last_sorted_pos);
+ bool need_init_agg = true;
+ //only init agg if needed
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ if (!temp_row_in_blocks.empty() &&
+ (*_vec_row_comparator)(prev_row, _row_in_blocks[i]) == 0) {
+ if (need_init_agg) {
+ prev_row->init_agg_places(
+ _arena->aligned_alloc(_total_size_of_aggregate_states, 16),
+ _offsets_of_aggregate_states.data());
+ for (auto cid = _schema->num_key_columns(); cid < _schema->num_columns(); cid++) {
+ auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+ auto data = prev_row->agg_places(cid);
+ _agg_functions[cid]->create(data);
+ _agg_functions[cid]->add(
Review Comment:
_agg_functions may be inited in last call of merge_row_in_block. we should push init_agg as a member of MemTable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191833860
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+ vectorized::Block in_block = _input_mutable_block.to_block();
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ _last_sorted_pos = _row_in_blocks.size();
+ return same_keys_num;
+}
+
template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
Review Comment:
aggregate is a better name.
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+ vectorized::Block in_block = _input_mutable_block.to_block();
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ _last_sorted_pos = _row_in_blocks.size();
+ return same_keys_num;
+}
+
template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
Review Comment:
_aggregate is a better name.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194735773
##########
be/src/olap/memtable.cpp:
##########
@@ -267,102 +230,152 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
-template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- if (_keys_type == KeysType::DUP_KEYS) {
- if (_schema->num_key_columns() > 0) {
- _collect_dup_table_with_keys();
- } else {
- // skip sort if the table is dup table without keys
- _collect_dup_table_without_keys();
- }
- } else {
- VecTable::Iterator it(_vec_skip_list.get());
- vectorized::Block in_block = _input_mutable_block.to_block();
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- auto& block_data = in_block.get_columns_with_type_and_name();
- // move key columns
- for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
- _output_mutable_block.get_column_by_position(i)->insert_from(
- *block_data[i].column.get(), it.key()->_row_pos);
- }
- // get value columns from agg_places
- for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
- auto function = _agg_functions[i];
- auto agg_place = it.key()->agg_places(i);
- auto col_ptr = _output_mutable_block.get_column_by_position(i).get();
- function->insert_result_into(agg_place, *col_ptr);
- if constexpr (is_final) {
- function->destroy(agg_place);
- } else {
- function->reset(agg_place);
- function->add(agg_place,
- const_cast<const doris::vectorized::IColumn**>(&col_ptr), idx,
- nullptr);
- }
- }
- if constexpr (!is_final) {
- // re-index the row_pos in VSkipList
- it.key()->_row_pos = idx;
- idx++;
- }
- }
- if constexpr (!is_final) {
- // if is not final, we collect the agg results to input_block and then continue to insert
- size_t shrunked_after_agg = _output_mutable_block.allocated_bytes();
- // flush will not run here, so will not duplicate `_flush_mem_tracker`
- _insert_mem_tracker->consume(shrunked_after_agg - _mem_usage);
- _mem_usage = shrunked_after_agg;
- _input_mutable_block.swap(_output_mutable_block);
- //TODO(weixang):opt here.
- std::unique_ptr<vectorized::Block> empty_input_block =
- in_block.create_same_struct_block(0);
- _output_mutable_block =
- vectorized::MutableBlock::build_mutable_block(empty_input_block.get());
- _output_mutable_block.clear_column_data();
- }
- }
-
- if (is_final) {
- _vec_skip_list.reset();
+void MemTable::prepare_block_for_flush(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
}
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
Review Comment:
we should skip memcopy for dupliate without keys like _collect_dup_table_without_keys does.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558362444
TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 33.55 seconds
stream load tsv: 421 seconds loaded 74807831229 Bytes, about 169 MB/s
stream load json: 22 seconds loaded 2358488459 Bytes, about 102 MB/s
stream load orc: 59 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 75.3 seconds inserted 10000000 Rows, about 132K ops/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230523014327_clickbench_pr_148572.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554304747
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] cnissnzg commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201660560
##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
// and use the ids to load segment data file for calc delete bitmap.
int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
RETURN_IF_ERROR(_do_flush(duration_ns));
+ _delta_writer_callback(_merged_rows);
Review Comment:
Oh yes, I think we should remove this line in _flush_memtable_async.
Now we only use _merged_rows to check correctness, and we can't get correct _merged_rows until flush has finished.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring merged pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring merged PR #19099:
URL: https://github.com/apache/doris/pull/19099
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1527108149
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1543595640
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1552679735
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201615144
##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
// and use the ids to load segment data file for calc delete bitmap.
int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
RETURN_IF_ERROR(_do_flush(duration_ns));
+ _delta_writer_callback(_merged_rows);
Review Comment:
OK I got it, `_mem_table->merged_rows()` was updated in `MemTable::_do_flush()`.
```cpp
RETURN_IF_ERROR(_do_flush(duration_ns));
delta_writer_callback(_merged_rows);
```
The question is, will it be added twice?
```cpp
// in DeltaWriter::_flush_memtable_async()
_merged_rows += _mem_table->merged_rows(); // what if merged_rows() is non-zero here?
// in MemTable::flush() or MemTable::need_flush()
_delta_writer_callback(_merged_rows)
```
I wonder if there is a better way to make this more clear to read.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1559034498
TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 32.96 seconds
stream load tsv: 429 seconds loaded 74807831229 Bytes, about 166 MB/s
stream load json: 23 seconds loaded 2358488459 Bytes, about 97 MB/s
stream load orc: 59 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 76.2 seconds inserted 10000000 Rows, about 131K ops/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230523104117_clickbench_pr_148933.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1557025530
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191831840
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
Review Comment:
Name _sort is enough.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191835438
##########
be/src/olap/memtable.cpp:
##########
@@ -144,19 +144,6 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
}
MemTable::~MemTable() {
- if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
- VecTable::Iterator it(_vec_skip_list.get());
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- // We should release agg_places here, because they are not released when a
- // load is canceled.
- for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
- auto function = _agg_functions[i];
- DCHECK(function != nullptr);
- DCHECK(it.key()->agg_places(i) != nullptr);
- function->destroy(it.key()->agg_places(i));
Review Comment:
we should destroy agg functions, otherwise memleak happens.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1553971330
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556109222
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558688433
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556217650
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1549725821
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1527113691
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1536978322
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1548938109
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194725701
##########
be/src/olap/memtable.cpp:
##########
@@ -440,7 +453,12 @@ Status MemTable::flush() {
Status MemTable::_do_flush(int64_t& duration_ns) {
SCOPED_CONSUME_MEM_TRACKER(_flush_mem_tracker);
SCOPED_RAW_TIMER(&duration_ns);
- _collect_vskiplist_results<true>();
+ _sort();
+ if (_keys_type == KeysType::DUP_KEYS) {
+ vectorized::Block in_block = _input_mutable_block.to_block();
+ prepare_block_for_flush(in_block);
+ }
+ { _aggregate<true>(); }
Review Comment:
else is expected?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194722810
##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,15 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
}
MemTable::~MemTable() {
- if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
- VecTable::Iterator it(_vec_skip_list.get());
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
+ if (!_row_in_blocks.empty() && _keys_type != KeysType::DUP_KEYS) {
Review Comment:
_row_in_blocks.emtry is redunt.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194722810
##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,15 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
}
MemTable::~MemTable() {
- if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
- VecTable::Iterator it(_vec_skip_list.get());
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
+ if (!_row_in_blocks.empty() && _keys_type != KeysType::DUP_KEYS) {
Review Comment:
_row_in_blocks.emtry is redudant.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194769944
##########
be/src/olap/memtable.cpp:
##########
@@ -267,102 +230,152 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
-template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- if (_keys_type == KeysType::DUP_KEYS) {
- if (_schema->num_key_columns() > 0) {
- _collect_dup_table_with_keys();
- } else {
- // skip sort if the table is dup table without keys
- _collect_dup_table_without_keys();
- }
- } else {
- VecTable::Iterator it(_vec_skip_list.get());
- vectorized::Block in_block = _input_mutable_block.to_block();
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- auto& block_data = in_block.get_columns_with_type_and_name();
- // move key columns
- for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
- _output_mutable_block.get_column_by_position(i)->insert_from(
- *block_data[i].column.get(), it.key()->_row_pos);
- }
- // get value columns from agg_places
- for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
- auto function = _agg_functions[i];
- auto agg_place = it.key()->agg_places(i);
- auto col_ptr = _output_mutable_block.get_column_by_position(i).get();
- function->insert_result_into(agg_place, *col_ptr);
- if constexpr (is_final) {
- function->destroy(agg_place);
- } else {
- function->reset(agg_place);
- function->add(agg_place,
- const_cast<const doris::vectorized::IColumn**>(&col_ptr), idx,
- nullptr);
- }
- }
- if constexpr (!is_final) {
- // re-index the row_pos in VSkipList
- it.key()->_row_pos = idx;
- idx++;
- }
- }
- if constexpr (!is_final) {
- // if is not final, we collect the agg results to input_block and then continue to insert
- size_t shrunked_after_agg = _output_mutable_block.allocated_bytes();
- // flush will not run here, so will not duplicate `_flush_mem_tracker`
- _insert_mem_tracker->consume(shrunked_after_agg - _mem_usage);
- _mem_usage = shrunked_after_agg;
- _input_mutable_block.swap(_output_mutable_block);
- //TODO(weixang):opt here.
- std::unique_ptr<vectorized::Block> empty_input_block =
- in_block.create_same_struct_block(0);
- _output_mutable_block =
- vectorized::MutableBlock::build_mutable_block(empty_input_block.get());
- _output_mutable_block.clear_column_data();
- }
- }
-
- if (is_final) {
- _vec_skip_list.reset();
+void MemTable::prepare_block_for_flush(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
}
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
}
-
-void MemTable::_collect_dup_table_with_keys() {
+int MemTable::_sort() {
vectorized::Block in_block = _input_mutable_block.to_block();
vectorized::MutableBlock mutable_block =
vectorized::MutableBlock::build_mutable_block(&in_block);
_vec_row_comparator->set_block(&mutable_block);
- std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
- [this](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
auto value = (*(this->_vec_row_comparator))(l, r);
if (value == 0) {
+ same_keys_num++;
return l->_row_pos > r->_row_pos;
} else {
return value < 0;
}
});
- std::vector<int> row_pos_vec;
- DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
- row_pos_vec.reserve(in_block.rows());
- for (int i = 0; i < _row_in_blocks.size(); i++) {
- row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ _last_sorted_pos = _row_in_blocks.size();
+ return same_keys_num;
+}
+
+template <bool is_final>
+void MemTable::_aggregate_one_row(RowInBlock* row,
+ const vectorized::ColumnsWithTypeAndName& block_data,
Review Comment:
finalize_one_row is a better name.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1195870191
##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,14 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
}
MemTable::~MemTable() {
- if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
- VecTable::Iterator it(_vec_skip_list.get());
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
+ if (_keys_type != KeysType::DUP_KEYS) {
+ for (auto it = _row_in_blocks.begin(); it != _row_in_blocks.end(); it++) {
Review Comment:
if (!it->has_init_agg()) {
continue;
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556227736
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1555905698
TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 36.15 seconds
stream load tsv: 433 seconds loaded 74807831229 Bytes, about 164 MB/s
stream load json: 24 seconds loaded 2358488459 Bytes, about 93 MB/s
stream load orc: 60 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 78.4 seconds inserted 10000000 Rows, about 127K ops/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230520125426_clickbench_pr_147718.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1527114742
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1536985861
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191832160
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
Review Comment:
prepare_block_for_flush is a better name
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191833538
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+ vectorized::Block in_block = _input_mutable_block.to_block();
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ _last_sorted_pos = _row_in_blocks.size();
+ return same_keys_num;
+}
+
template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
vectorized::Block in_block = _input_mutable_block.to_block();
- if (_keys_type == KeysType::DUP_KEYS) {
- vectorized::MutableBlock mutable_block =
- vectorized::MutableBlock::build_mutable_block(&in_block);
- _vec_row_comparator->set_block(&mutable_block);
- std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
- [this](const RowInBlock* l, const RowInBlock* r) -> bool {
- auto value = (*(this->_vec_row_comparator))(l, r);
- if (value == 0) {
- return l->_row_pos > r->_row_pos;
- } else {
- return value < 0;
- }
- });
- std::vector<int> row_pos_vec;
- DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
- row_pos_vec.reserve(in_block.rows());
- for (int i = 0; i < _row_in_blocks.size(); i++) {
- row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
- }
- _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
- row_pos_vec.data() + in_block.rows());
- } else {
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- auto& block_data = in_block.get_columns_with_type_and_name();
- // move key columns
- for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
- _output_mutable_block.get_column_by_position(i)->insert_from(
- *block_data[i].column.get(), it.key()->_row_pos);
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ std::vector<RowInBlock*> temp_row_in_blocks;
+ std::vector<RowInBlock*> agg_row_in_temp;
+ RowInBlock* prev_row;
+ temp_row_in_blocks.reserve(_last_sorted_pos);
+ agg_row_in_temp.reserve(_last_sorted_pos);
+ bool need_init_agg = true;
+ //only init agg if needed
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ if (!temp_row_in_blocks.empty() &&
+ (*_vec_row_comparator)(prev_row, _row_in_blocks[i]) == 0) {
+ if (need_init_agg) {
+ prev_row->init_agg_places(
+ _arena->aligned_alloc(_total_size_of_aggregate_states, 16),
+ _offsets_of_aggregate_states.data());
+ for (auto cid = _schema->num_key_columns(); cid < _schema->num_columns(); cid++) {
+ auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+ auto data = prev_row->agg_places(cid);
+ _agg_functions[cid]->create(data);
+ _agg_functions[cid]->add(
Review Comment:
And we put init code in a function called init_agg_functions()
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556820324
PR approved by at least one committer and no changes requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556125190
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550616348
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550769631
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550895246
PR approved by anyone and no changes requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554172274
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1558989286
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1539318519
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1191834809
##########
be/src/olap/memtable.cpp:
##########
@@ -267,79 +218,137 @@ void MemTable::_aggregate_two_row_in_block(RowInBlock* new_row, RowInBlock* row_
new_row->_row_pos, nullptr);
}
}
+void MemTable::_add_rows_from_block(vectorized::Block& in_block) {
+ std::vector<int> row_pos_vec;
+ DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
+ row_pos_vec.reserve(in_block.rows());
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
+ }
+ _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
+ row_pos_vec.data() + in_block.rows());
+}
+int MemTable::_sort_row_in_blocks() {
+ vectorized::Block in_block = _input_mutable_block.to_block();
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ size_t same_keys_num = 0;
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end(),
+ [this, &same_keys_num](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ same_keys_num++;
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ _last_sorted_pos = _row_in_blocks.size();
+ return same_keys_num;
+}
+
template <bool is_final>
-void MemTable::_collect_vskiplist_results() {
- VecTable::Iterator it(_vec_skip_list.get());
+void MemTable::_merge_row_in_blocks() {
vectorized::Block in_block = _input_mutable_block.to_block();
- if (_keys_type == KeysType::DUP_KEYS) {
- vectorized::MutableBlock mutable_block =
- vectorized::MutableBlock::build_mutable_block(&in_block);
- _vec_row_comparator->set_block(&mutable_block);
- std::sort(_row_in_blocks.begin(), _row_in_blocks.end(),
- [this](const RowInBlock* l, const RowInBlock* r) -> bool {
- auto value = (*(this->_vec_row_comparator))(l, r);
- if (value == 0) {
- return l->_row_pos > r->_row_pos;
- } else {
- return value < 0;
- }
- });
- std::vector<int> row_pos_vec;
- DCHECK(in_block.rows() <= std::numeric_limits<int>::max());
- row_pos_vec.reserve(in_block.rows());
- for (int i = 0; i < _row_in_blocks.size(); i++) {
- row_pos_vec.emplace_back(_row_in_blocks[i]->_row_pos);
- }
- _output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
- row_pos_vec.data() + in_block.rows());
- } else {
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
- auto& block_data = in_block.get_columns_with_type_and_name();
- // move key columns
- for (size_t i = 0; i < _schema->num_key_columns(); ++i) {
- _output_mutable_block.get_column_by_position(i)->insert_from(
- *block_data[i].column.get(), it.key()->_row_pos);
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ std::vector<RowInBlock*> temp_row_in_blocks;
+ std::vector<RowInBlock*> agg_row_in_temp;
+ RowInBlock* prev_row;
+ temp_row_in_blocks.reserve(_last_sorted_pos);
+ agg_row_in_temp.reserve(_last_sorted_pos);
+ bool need_init_agg = true;
+ //only init agg if needed
+ for (int i = 0; i < _row_in_blocks.size(); i++) {
+ if (!temp_row_in_blocks.empty() &&
+ (*_vec_row_comparator)(prev_row, _row_in_blocks[i]) == 0) {
+ if (need_init_agg) {
+ prev_row->init_agg_places(
+ _arena->aligned_alloc(_total_size_of_aggregate_states, 16),
+ _offsets_of_aggregate_states.data());
+ for (auto cid = _schema->num_key_columns(); cid < _schema->num_columns(); cid++) {
+ auto col_ptr = _input_mutable_block.mutable_columns()[cid].get();
+ auto data = prev_row->agg_places(cid);
+ _agg_functions[cid]->create(data);
+ _agg_functions[cid]->add(
+ data, const_cast<const doris::vectorized::IColumn**>(&col_ptr),
+ prev_row->_row_pos, nullptr);
+ }
+ agg_row_in_temp.push_back(prev_row);
+ need_init_agg = false;
}
+ _merged_rows++;
+ _aggregate_two_row_in_block(_row_in_blocks[i], prev_row);
+ } else {
+ prev_row = _row_in_blocks[i];
+ temp_row_in_blocks.push_back(prev_row);
+ need_init_agg = true;
+ }
+ }
Review Comment:
we should stat merged rows in one run and we can optimize move operations below.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1194759204
##########
be/src/olap/memtable.cpp:
##########
@@ -144,16 +144,15 @@ void MemTable::_init_agg_functions(const vectorized::Block* block) {
}
MemTable::~MemTable() {
- if (_vec_skip_list != nullptr && _keys_type != KeysType::DUP_KEYS) {
- VecTable::Iterator it(_vec_skip_list.get());
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
+ if (!_row_in_blocks.empty() && _keys_type != KeysType::DUP_KEYS) {
+ for (auto it = _row_in_blocks.begin(); it != _row_in_blocks.end(); it++) {
// We should release agg_places here, because they are not released when a
// load is canceled.
for (size_t i = _schema->num_key_columns(); i < _num_columns; ++i) {
auto function = _agg_functions[i];
DCHECK(function != nullptr);
- DCHECK(it.key()->agg_places(i) != nullptr);
- function->destroy(it.key()->agg_places(i));
+ DCHECK((*it)->agg_places(i) != nullptr);
+ function->destroy((*it)->agg_places(i));
Review Comment:
If writes stop and just destruct then, there is no agg_places, dcheck here does not work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1549716280
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] cnissnzg commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "cnissnzg (via GitHub)" <gi...@apache.org>.
cnissnzg commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556229080
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201597555
##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
// and use the ids to load segment data file for calc delete bitmap.
int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
RETURN_IF_ERROR(_do_flush(duration_ns));
+ _delta_writer_callback(_merged_rows);
Review Comment:
Why do we need this callback?
AFAIK, `_merged_rows` is already increased in
```cpp
Status DeltaWriter::_flush_memtable_async() {
_merged_rows += _mem_table->merged_rows();
return _flush_token->submit(std::move(_mem_table));
}
```
The callstack (one of the code paths):
```cpp
DeltaWriter::_flush_memtable_async();
=> FlushToken::submit();
=> FlushToken::_flush_memtable();
=> new MemtableFlushTask();
=> MemtableFlushTask::run();
=> MemTable::flush(); // in the latest master, equivalent to MemTable::need_flush()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1553910586
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550895105
PR approved by at least one committer and no changes requested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1197710765
##########
be/src/olap/memtable.cpp:
##########
@@ -440,7 +463,13 @@ Status MemTable::flush() {
Status MemTable::_do_flush(int64_t& duration_ns) {
SCOPED_CONSUME_MEM_TRACKER(_flush_mem_tracker);
SCOPED_RAW_TIMER(&duration_ns);
- _collect_vskiplist_results<true>();
+ _sort();
+ if (_keys_type == KeysType::DUP_KEYS) {
Review Comment:
int same_keys_num = _sort();
if (_keys_type == KeysType::DUP_KEYS || same_keys_num == 0)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] hello-stephen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556129856
TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 36.22 seconds
stream load tsv: 430 seconds loaded 74807831229 Bytes, about 165 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 61 seconds loaded 1101869774 Bytes, about 17 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 78.0 seconds inserted 10000000 Rows, about 128K ops/s
https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230521092756_clickbench_pr_147843.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1184600386
##########
be/src/olap/memtable.cpp:
##########
@@ -286,32 +292,90 @@ void MemTable::_collect_vskiplist_results() {
_output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
row_pos_vec.data() + in_block.rows());
} else {
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
+ size_t idx = 0, temp_idx = 0;
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ return l->_row_pos > r->_row_pos;
Review Comment:
add a stats here, e.g. same_keys_num++;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1184600830
##########
be/src/olap/memtable.cpp:
##########
@@ -286,32 +292,90 @@ void MemTable::_collect_vskiplist_results() {
_output_mutable_block.add_rows(&in_block, row_pos_vec.data(),
row_pos_vec.data() + in_block.rows());
} else {
- size_t idx = 0;
- for (it.SeekToFirst(); it.Valid(); it.Next()) {
+ size_t idx = 0, temp_idx = 0;
+ vectorized::MutableBlock mutable_block =
+ vectorized::MutableBlock::build_mutable_block(&in_block);
+ _vec_row_comparator->set_block(&mutable_block);
+ auto new_row_it = std::next(_row_in_blocks.begin(), _last_sorted_pos);
+ // sort new rows
+ std::sort(new_row_it, _row_in_blocks.end(),
+ [this](const RowInBlock* l, const RowInBlock* r) -> bool {
+ auto value = (*(this->_vec_row_comparator))(l, r);
+ if (value == 0) {
+ return l->_row_pos > r->_row_pos;
+ } else {
+ return value < 0;
+ }
+ });
+ // merge new rows and old rows
+ std::inplace_merge(_row_in_blocks.begin(), new_row_it, _row_in_blocks.end());
+ _last_sorted_pos = _row_in_blocks.size();
+ std::vector<RowInBlock*> temp_row_in_blocks(_last_sorted_pos);
+ std::vector<size_t> need_merge_idxs; // use bitset?
+ bool need_init_agg = true;
Review Comment:
if (same_keys_num == 0) skip agg.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on a diff in pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1184606638
##########
be/src/olap/memtable.cpp:
##########
@@ -21,10 +21,13 @@
#include <gen_cpp/olap_file.pb.h>
#include <algorithm>
+#include <bitset>
Review Comment:
bitset is not used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556218016
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] kaijchen commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1556617784
Test result of stream load into unique key table with 3 replicas.
| test suite | before this change | after this change |
| --- | --- | --- |
| 7.5GB, 1 bucket, single_replica_read = false | 160s | 130s |
| 96GB, 960 bucket, single_replica_read = false | 6189s | 1360s |
| 96GB, 960 bucket, single_replica_read = true | 2318s | 919s |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1554500638
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] kaijchen commented on a diff in pull request #19099: [performance](load) use vector instead of skiplist when insert agg keys
Posted by "kaijchen (via GitHub)" <gi...@apache.org>.
kaijchen commented on code in PR #19099:
URL: https://github.com/apache/doris/pull/19099#discussion_r1201597555
##########
be/src/olap/memtable.cpp:
##########
@@ -425,6 +443,7 @@ Status MemTable::flush() {
// and use the ids to load segment data file for calc delete bitmap.
int64_t atomic_num_segments_before_flush = _rowset_writer->get_atomic_num_segment();
RETURN_IF_ERROR(_do_flush(duration_ns));
+ _delta_writer_callback(_merged_rows);
Review Comment:
Why do we need this callback?
AFAIK, `_merged_rows` is already increased here.
I suppose `_mem_table->merged_rows()` has not been updated here?
```cpp
Status DeltaWriter::_flush_memtable_async() {
_merged_rows += _mem_table->merged_rows();
return _flush_token->submit(std::move(_mem_table));
}
```
The callstack (one of the code paths):
```cpp
DeltaWriter::_flush_memtable_async();
=> FlushToken::submit();
=> FlushToken::_flush_memtable();
=> new MemtableFlushTask();
=> MemtableFlushTask::run();
=> MemTable::flush(); // in the latest master, equivalent to MemTable::need_flush()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550687207
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1552685795
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] dataroaring commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "dataroaring (via GitHub)" <gi...@apache.org>.
dataroaring commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1550608252
run buildall
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org
[GitHub] [doris] github-actions[bot] commented on pull request #19099: [performance](stream-load) use vector instead of skiplist when insert agg keys
Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #19099:
URL: https://github.com/apache/doris/pull/19099#issuecomment-1522883703
clang-tidy review says "All clean, LGTM! :+1:"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org