You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2021/01/26 05:49:59 UTC

[GitHub] [incubator-doris] stdpain opened a new pull request #5301: [optimization] avoid extra memory copy while build hash table

stdpain opened a new pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301


   ## Proposed changes
   
   avoid extra memory copy while build hash table
   
   reference to #5300 
   
   ## Types of changes
   
   - [] Bugfix (non-breaking change which fixes an issue)
   - [] New feature (non-breaking change which adds functionality)
   - [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
   - [] Documentation Update (if none of the other choices apply)
   - [x] Code refactor (Modify the code structure, format the code, etc...)
   
   ## Checklist
   
   - [x] I have created an issue on (Fix #ISSUE) and described the bug/feature there in detail
   - [x] Compiling and unit tests pass locally with my changes
   - [x] I have added tests that prove my fix is effective or that my feature works
   - [x] Any dependent changes have been merged
   
   ## Further comments
   
   This modification may have a performance improvement of more than 25% during the hash table construction phase.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stdpain commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
stdpain commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r565785779



##########
File path: be/src/exec/hash_table.hpp
##########
@@ -72,6 +70,7 @@ inline HashTable::Bucket* HashTable::next_bucket(int64_t* bucket_idx) {
     return NULL;
 }
 
+// TODO:

Review comment:
       Sorry, this is a useless comment. I'll fix it




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r565279504



##########
File path: be/src/exec/hash_table.h
##########
@@ -354,14 +344,16 @@ class HashTable {
     const int _node_byte_size;
     // Number of non-empty buckets.  Used to determine when to grow and rehash
     int64_t _num_filled_buckets;
-    // Memory to store node data.  This is not allocated from a pool to take advantage
-    // of realloc.
-    // TODO: integrate with mem pools
-    uint8_t* _nodes;
+    // Buffer to store node data.
+    uint8_t* _current_nodes;
     // number of nodes stored (i.e. size of hash table)
     int64_t _num_nodes;

Review comment:
       ```suggestion
       int64_t _total_used;
   ```
   Easier to understand 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stdpain commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
stdpain commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r564410490



##########
File path: be/src/exec/hash_table.h
##########
@@ -354,14 +344,16 @@ class HashTable {
     const int _node_byte_size;
     // Number of non-empty buckets.  Used to determine when to grow and rehash
     int64_t _num_filled_buckets;
-    // Memory to store node data.  This is not allocated from a pool to take advantage
-    // of realloc.
-    // TODO: integrate with mem pools
-    uint8_t* _nodes;
+    // Buffer to store node data.
+    uint8_t* _current_nodes;
     // number of nodes stored (i.e. size of hash table)
     int64_t _num_nodes;
-    // max number of nodes that can be stored in '_nodes' before realloc
-    int64_t _nodes_capacity;
+    // current nodes buffer capacity
+    int64_t _current_capacity;
+    // current used
+    int64_t _current_used;
+    // total capacity
+    int64_t _total_capacity;

Review comment:
       I hope Node's memory allocation strategy is like STL's vector. Increase by half of the current capcity each time. Of course, it is feasible to allocate fixed-length memory each time, but it is difficult to find a suitable value.
   
   _total_capacity is record current total capacity




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stdpain commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
stdpain commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r564406631



##########
File path: be/src/exec/hash_table.hpp
##########
@@ -82,77 +81,78 @@ inline void HashTable::insert_impl(TupleRow* row) {
     uint32_t hash = hash_current_row();
     int64_t bucket_idx = hash & (_num_buckets - 1);
 
-    if (_num_nodes == _nodes_capacity) {
+    if (_current_used == _current_capacity) {
         grow_node_array();
     }
+    // get a node from memory pool
+    Node* node = reinterpret_cast<Node*>(_current_nodes + _node_byte_size * _current_used++);
 
-    Node* node = get_node(_num_nodes);
     TupleRow* data = node->data();
     node->_hash = hash;
     memcpy(data, row, sizeof(Tuple*) * _num_build_tuples);
-    add_to_bucket(&_buckets[bucket_idx], _num_nodes, node);
+    add_to_bucket(&_buckets[bucket_idx], node);
     ++_num_nodes;

Review comment:
       _num_nodes is used by HashTable::size()




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r564365947



##########
File path: be/src/exec/hash_table.hpp
##########
@@ -82,77 +81,78 @@ inline void HashTable::insert_impl(TupleRow* row) {
     uint32_t hash = hash_current_row();
     int64_t bucket_idx = hash & (_num_buckets - 1);
 
-    if (_num_nodes == _nodes_capacity) {
+    if (_current_used == _current_capacity) {
         grow_node_array();
     }
+    // get a node from memory pool
+    Node* node = reinterpret_cast<Node*>(_current_nodes + _node_byte_size * _current_used++);
 
-    Node* node = get_node(_num_nodes);
     TupleRow* data = node->data();
     node->_hash = hash;
     memcpy(data, row, sizeof(Tuple*) * _num_build_tuples);
-    add_to_bucket(&_buckets[bucket_idx], _num_nodes, node);
+    add_to_bucket(&_buckets[bucket_idx], node);
     ++_num_nodes;

Review comment:
       _num_nodes can be removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] stdpain commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
stdpain commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r564410490



##########
File path: be/src/exec/hash_table.h
##########
@@ -354,14 +344,16 @@ class HashTable {
     const int _node_byte_size;
     // Number of non-empty buckets.  Used to determine when to grow and rehash
     int64_t _num_filled_buckets;
-    // Memory to store node data.  This is not allocated from a pool to take advantage
-    // of realloc.
-    // TODO: integrate with mem pools
-    uint8_t* _nodes;
+    // Buffer to store node data.
+    uint8_t* _current_nodes;
     // number of nodes stored (i.e. size of hash table)
     int64_t _num_nodes;
-    // max number of nodes that can be stored in '_nodes' before realloc
-    int64_t _nodes_capacity;
+    // current nodes buffer capacity
+    int64_t _current_capacity;
+    // current used
+    int64_t _current_used;
+    // total capacity
+    int64_t _total_capacity;

Review comment:
       I hope Node's memory allocation strategy is like STL's vector. Increase by half of the current capcity each time. Of course, it is feasible to allocate fixed-length memory each time, but it is difficult to find a suitable value.
   
   _total_capacity is record current total capcity




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r564369483



##########
File path: be/src/exec/hash_table.cpp
##########
@@ -233,19 +239,19 @@ void HashTable::resize_buckets(int64_t num_buckets) {
 }
 
 void HashTable::grow_node_array() {
-    int64_t old_size = _nodes_capacity * _node_byte_size;
-    _nodes_capacity = _nodes_capacity + _nodes_capacity / 2;
-    int64_t new_size = _nodes_capacity * _node_byte_size;
-
-    uint8_t* new_nodes = reinterpret_cast<uint8_t*>(malloc(new_size));
-    memset(new_nodes, 0, new_size);
-    memcpy(new_nodes, _nodes, old_size);
-    free(_nodes);
-    _nodes = new_nodes;
-
-    _mem_tracker->Consume(new_size - old_size);
+    _current_capacity = _total_capacity / 2;

Review comment:
       It seems that ```_current_capacity```  become smaller after grow_node_array.
   Why design like this?
   It may cause more  ```malloc``` calls.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r565265346



##########
File path: be/src/exec/hash_table.cpp
##########
@@ -233,19 +239,19 @@ void HashTable::resize_buckets(int64_t num_buckets) {
 }
 
 void HashTable::grow_node_array() {
-    int64_t old_size = _nodes_capacity * _node_byte_size;
-    _nodes_capacity = _nodes_capacity + _nodes_capacity / 2;
-    int64_t new_size = _nodes_capacity * _node_byte_size;
-
-    uint8_t* new_nodes = reinterpret_cast<uint8_t*>(malloc(new_size));
-    memset(new_nodes, 0, new_size);
-    memcpy(new_nodes, _nodes, old_size);
-    free(_nodes);
-    _nodes = new_nodes;
-
-    _mem_tracker->Consume(new_size - old_size);
+    _current_capacity = _total_capacity / 2;
+    _total_capacity += _current_capacity;
+    int64_t alloc_size = _current_capacity * _node_byte_size;
+    _current_nodes = reinterpret_cast<uint8_t*>(malloc(alloc_size));
+    _current_used = 0;
+    // TODO: remote memset later

Review comment:
       ```suggestion
       // TODO: remove memset later
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r565274289



##########
File path: be/src/exec/hash_table.hpp
##########
@@ -72,6 +70,7 @@ inline HashTable::Bucket* HashTable::next_bucket(int64_t* bucket_idx) {
     return NULL;
 }
 
+// TODO:

Review comment:
       todo what here




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301#discussion_r564372911



##########
File path: be/src/exec/hash_table.h
##########
@@ -354,14 +344,16 @@ class HashTable {
     const int _node_byte_size;
     // Number of non-empty buckets.  Used to determine when to grow and rehash
     int64_t _num_filled_buckets;
-    // Memory to store node data.  This is not allocated from a pool to take advantage
-    // of realloc.
-    // TODO: integrate with mem pools
-    uint8_t* _nodes;
+    // Buffer to store node data.
+    uint8_t* _current_nodes;
     // number of nodes stored (i.e. size of hash table)
     int64_t _num_nodes;
-    // max number of nodes that can be stored in '_nodes' before realloc
-    int64_t _nodes_capacity;
+    // current nodes buffer capacity
+    int64_t _current_capacity;
+    // current used
+    int64_t _current_used;
+    // total capacity
+    int64_t _total_capacity;

Review comment:
       What is the meaning of ```_total_capacity```?
   It neither affect memory allocation , nor ```grow_node_array```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman merged pull request #5301: [optimization] avoid extra memory copy while build hash table

Posted by GitBox <gi...@apache.org>.
morningman merged pull request #5301:
URL: https://github.com/apache/incubator-doris/pull/5301


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org