You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "mrhhsg (via GitHub)" <gi...@apache.org> on 2023/06/30 02:26:20 UTC

[GitHub] [doris] mrhhsg opened a new pull request, #21361: [improvement](join) Serialize build keys in a vectorized (columnar) way

mrhhsg opened a new pull request, #21361:
URL: https://github.com/apache/doris/pull/21361

   ## Proposed changes
   
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei merged pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei merged PR #21361:
URL: https://github.com/apache/doris/pull/21361


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1616255109

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21361: [improvement](join) Serialize build keys in a vectorized (columnar) way

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1614030031

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1615966828

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21361:
URL: https://github.com/apache/doris/pull/21361#discussion_r1248372162


##########
be/src/vec/exec/join/process_hash_table_probe_impl.h:
##########
@@ -177,22 +177,51 @@ Status ProcessHashTableProbe<JoinOpType>::do_process(HashTableType& hash_table_c
     KeyGetter key_getter(probe_raw_ptrs, _join_node->_probe_key_sz, nullptr);
 
     if (probe_index == 0) {
-        size_t old_probe_keys_memory_usage = 0;
-        if (_arena) {
-            old_probe_keys_memory_usage = _arena->size();
+        if (!_arena) {
+            _arena.reset(new Arena());
         }
-        _arena.reset(new Arena()); // TODO arena reuse by clear()?
         if constexpr (ColumnsHashing::IsPreSerializedKeysHashMethodTraits<KeyGetter>::value) {
             if (_probe_keys.size() < probe_rows) {
                 _probe_keys.resize(probe_rows);
             }
-            size_t keys_size = probe_raw_ptrs.size();
-            for (size_t i = 0; i < probe_rows; ++i) {
-                _probe_keys[i] =
-                        serialize_keys_to_pool_contiguous(i, keys_size, probe_raw_ptrs, *_arena);
+            size_t max_one_row_byte_size = 0;
+            for (const auto column : probe_raw_ptrs) {
+                max_one_row_byte_size += column->get_max_row_byte_size();
+            }
+            size_t total_bytes = max_one_row_byte_size * probe_rows;
+
+            if (total_bytes > config::pre_serialize_keys_limit_bytes) {
+                // reach mem limit, don't serialize in batch
+                _arena->clear();
+                size_t keys_size = probe_raw_ptrs.size();
+                for (size_t i = 0; i < probe_rows; ++i) {
+                    _probe_keys[i] = serialize_keys_to_pool_contiguous(i, keys_size, probe_raw_ptrs,
+                                                                       *_arena);
+                }
+                _join_node->_probe_arena_memory_usage->add(_arena->size());
+            } else {
+                _arena->clear();
+                if (!_serialize_key_arena) {
+                    _serialize_key_arena.reset(new Arena);
+                }
+                if (total_bytes > _serialized_key_buffer_size) {
+                    _serialized_key_buffer_size = total_bytes;
+                    _serialize_key_arena->clear();
+                    _serialized_key_buffer = reinterpret_cast<uint8_t*>(
+                            _serialize_key_arena->alloc(_serialized_key_buffer_size));
+                }
+
+                for (size_t i = 0; i < probe_rows; ++i) {
+                    _probe_keys[i].data = reinterpret_cast<char*>(_serialized_key_buffer +
+                                                                  i * max_one_row_byte_size);
+                    _probe_keys[i].size = 0;
+                }
+
+                for (const auto column : probe_raw_ptrs) {
+                    column->serialize_vec(_probe_keys, probe_rows, max_one_row_byte_size);
+                }
+                _join_node->_probe_arena_memory_usage->add(_serialized_key_buffer_size);
             }
-            _join_node->_probe_arena_memory_usage->add(_arena->size() -
-                                                       old_probe_keys_memory_usage);
         }

Review Comment:
   should release arena's memory in close method to release memory earlier.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on pull request #21361: [improvement](join) Serialize build keys in a vectorized (columnar) way

Posted by "mrhhsg (via GitHub)" <gi...@apache.org>.
mrhhsg commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1614030328

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1616255114

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] hello-stephen commented on pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "hello-stephen (via GitHub)" <gi...@apache.org>.
hello-stephen commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1615978838

   TeamCity pipeline, clickbench performance test result:
    the sum of best hot time: 42.82 seconds
    stream load tsv:          465 seconds loaded 74807831229 Bytes, about 153 MB/s
    stream load json:         20 seconds loaded 2358488459 Bytes, about 112 MB/s
    stream load orc:          57 seconds loaded 1101869774 Bytes, about 18 MB/s
    stream load parquet:          28 seconds loaded 861443392 Bytes, about 29 MB/s
    insert into select:          69.7 seconds inserted 10000000 Rows, about 143K ops/s
    https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/tmp/20230701161640_clickbench_pr_170877.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #21361:
URL: https://github.com/apache/doris/pull/21361#issuecomment-1615968278

   clang-tidy review says "All clean, LGTM! :+1:"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21361:
URL: https://github.com/apache/doris/pull/21361#discussion_r1248371936


##########
be/src/vec/exec/join/process_hash_table_probe_impl.h:
##########
@@ -177,22 +177,51 @@ Status ProcessHashTableProbe<JoinOpType>::do_process(HashTableType& hash_table_c
     KeyGetter key_getter(probe_raw_ptrs, _join_node->_probe_key_sz, nullptr);
 
     if (probe_index == 0) {
-        size_t old_probe_keys_memory_usage = 0;
-        if (_arena) {
-            old_probe_keys_memory_usage = _arena->size();
+        if (!_arena) {
+            _arena.reset(new Arena());
         }
-        _arena.reset(new Arena()); // TODO arena reuse by clear()?
         if constexpr (ColumnsHashing::IsPreSerializedKeysHashMethodTraits<KeyGetter>::value) {
             if (_probe_keys.size() < probe_rows) {
                 _probe_keys.resize(probe_rows);
             }
-            size_t keys_size = probe_raw_ptrs.size();
-            for (size_t i = 0; i < probe_rows; ++i) {
-                _probe_keys[i] =
-                        serialize_keys_to_pool_contiguous(i, keys_size, probe_raw_ptrs, *_arena);
+            size_t max_one_row_byte_size = 0;
+            for (const auto column : probe_raw_ptrs) {
+                max_one_row_byte_size += column->get_max_row_byte_size();
+            }
+            size_t total_bytes = max_one_row_byte_size * probe_rows;
+
+            if (total_bytes > config::pre_serialize_keys_limit_bytes) {
+                // reach mem limit, don't serialize in batch
+                _arena->clear();
+                size_t keys_size = probe_raw_ptrs.size();
+                for (size_t i = 0; i < probe_rows; ++i) {
+                    _probe_keys[i] = serialize_keys_to_pool_contiguous(i, keys_size, probe_raw_ptrs,
+                                                                       *_arena);
+                }
+                _join_node->_probe_arena_memory_usage->add(_arena->size());
+            } else {
+                _arena->clear();
+                if (!_serialize_key_arena) {
+                    _serialize_key_arena.reset(new Arena);
+                }
+                if (total_bytes > _serialized_key_buffer_size) {
+                    _serialized_key_buffer_size = total_bytes;
+                    _serialize_key_arena->clear();
+                    _serialized_key_buffer = reinterpret_cast<uint8_t*>(
+                            _serialize_key_arena->alloc(_serialized_key_buffer_size));
+                }
+
+                for (size_t i = 0; i < probe_rows; ++i) {
+                    _probe_keys[i].data = reinterpret_cast<char*>(_serialized_key_buffer +

Review Comment:
   using config pre_serialize_keys_limit_bytes to control not use this way. For example, if there is a very long string in the value, 100k, and other string is very small 100bytes. Then the max value is 100k, using too many memory.
   And please add this to comment so that we could know why add this config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #21361: [improvement](join) Serialize probe keys in a vectorized (columnar) way

Posted by "yiguolei (via GitHub)" <gi...@apache.org>.
yiguolei commented on code in PR #21361:
URL: https://github.com/apache/doris/pull/21361#discussion_r1248372069


##########
be/src/vec/exec/join/process_hash_table_probe_impl.h:
##########
@@ -177,22 +177,51 @@ Status ProcessHashTableProbe<JoinOpType>::do_process(HashTableType& hash_table_c
     KeyGetter key_getter(probe_raw_ptrs, _join_node->_probe_key_sz, nullptr);
 
     if (probe_index == 0) {
-        size_t old_probe_keys_memory_usage = 0;
-        if (_arena) {
-            old_probe_keys_memory_usage = _arena->size();
+        if (!_arena) {
+            _arena.reset(new Arena());

Review Comment:
   maybe we could not arena memory reuse? To avoid allocate memory too frequently and many page fault.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org