You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/07/08 09:43:14 UTC

[GitHub] [doris] mrhhsg opened a new pull request, #10700: [improvement]pre-serialize aggregation keys

mrhhsg opened a new pull request, #10700:
URL: https://github.com/apache/doris/pull/10700

   # Proposed changes
   
   Issue Number: close #xxx
   
   ## Problem Summary:
   
   Test with ssb-flat 100g with the SQL:
   ```sql
   select count() from ( SELECT  C_CITY,   SUM(LO_REVENUE) AS revenue FROM lineorder_flat GROUP BY C_CITY, S_CITY) a;
   ```
   
   ||non-pre serialize|pre serialize|
   |-|-|-|
   |profile|<img width="446" alt="image" src="https://user-images.githubusercontent.com/1179834/177964945-8803ad98-923b-4468-848d-2dd83c31ebb8.png">|<img width="454" alt="image" src="https://user-images.githubusercontent.com/1179834/177964545-899a4045-179c-47ea-8ec7-18fefc1d7e71.png">|
   
   ## Checklist(Required)
   
   1. Does it affect the original behavior: (Yes/No/I Don't know)
   2. Has unit tests been added: (Yes/No/No Need)
   3. Has document been added or modified: (Yes/No/No Need)
   4. Does it need to update dependencies: (Yes/No)
   5. Are there any changes that cannot be rolled back: (Yes/No)
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why you chose the solution you did and what alternatives you considered, etc...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916899062


##########
be/src/vec/exec/vaggregation_node.h:
##########
@@ -50,13 +50,41 @@ struct AggregationMethodSerialized {
     Data data;
     Iterator iterator;
     bool inited = false;
+    std::vector<StringRef> keys;
+    AggregationMethodSerialized()
+            : _serialized_key_buffer_size(0),
+              _serialized_key_buffer(nullptr),
+              _mem_pool(new MemPool) {}
 
-    AggregationMethodSerialized() = default;
+    using State = ColumnsHashing::HashMethodSerialized<typename Data::value_type, Mapped, true>;
 
     template <typename Other>
     explicit AggregationMethodSerialized(const Other& other) : data(other.data) {}
 
-    using State = ColumnsHashing::HashMethodSerialized<typename Data::value_type, Mapped>;
+    void serialize_keys(const ColumnRawPtrs& key_columns, const size_t num_rows) {
+        size_t max_one_row_byte_size = 0;
+        for (const auto& column : key_columns) {
+            max_one_row_byte_size += column->get_max_row_byte_size();

Review Comment:
   Maybe not, the memory is allocated block by block.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #10700:
URL: https://github.com/apache/doris/pull/10700#issuecomment-1179148094

   PR approved by anyone and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei merged pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
yiguolei merged PR #10700:
URL: https://github.com/apache/doris/pull/10700


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] github-actions[bot] commented on pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #10700:
URL: https://github.com/apache/doris/pull/10700#issuecomment-1179148054

   PR approved by at least one committer and no changes requested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916681902


##########
be/src/vec/exec/vaggregation_node.h:
##########
@@ -50,13 +50,42 @@ struct AggregationMethodSerialized {
     Data data;
     Iterator iterator;
     bool inited = false;
+    std::vector<StringRef> keys;
+    AggregationMethodSerialized()
+            : _serialized_key_buffer_size(0),
+              _serialized_key_buffer(nullptr),
+              _mem_pool(new MemPool) {}
 
-    AggregationMethodSerialized() = default;
+    using State = ColumnsHashing::HashMethodSerialized<typename Data::value_type, Mapped, true>;
 
     template <typename Other>
     explicit AggregationMethodSerialized(const Other& other) : data(other.data) {}
 
-    using State = ColumnsHashing::HashMethodSerialized<typename Data::value_type, Mapped>;
+    void serialize_keys(const ColumnRawPtrs& key_columns, const size_t num_rows) {
+        size_t max_one_row_byte_size = 0;
+        for (const auto& column : key_columns) {
+            max_one_row_byte_size +=
+                    std::max(max_one_row_byte_size, column->get_max_row_byte_size());

Review Comment:
   max_one_row_byte_size += column->get_max_row_byte_size() ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] BiteTheDDDDt commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
BiteTheDDDDt commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916864491


##########
be/src/vec/columns/column_nullable.cpp:
##########
@@ -134,6 +134,24 @@ const char* ColumnNullable::deserialize_and_insert_from_arena(const char* pos) {
     return pos;
 }
 
+size_t ColumnNullable::get_max_row_byte_size() const {
+    constexpr auto flag_size = sizeof(get_null_map_data()[0]);
+    return flag_size + get_nested_column().get_max_row_byte_size();
+}
+
+void ColumnNullable::serialize_vec(std::vector<StringRef>& keys, size_t num_rows,
+                                   size_t max_row_byte_size) const {
+    const auto& arr = get_null_map_data();
+    static constexpr auto s = sizeof(arr[0]);
+    for (size_t i = 0; i < num_rows; ++i) {
+        auto* val = const_cast<char*>(keys[i].data + keys[i].size);
+        *val = (arr[i] ? 1 : 0);

Review Comment:
   Can we just use `*val=arr[i]` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] BiteTheDDDDt commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
BiteTheDDDDt commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916821559


##########
be/src/vec/common/columns_hashing.h:
##########
@@ -111,29 +111,48 @@ struct HashMethodString : public columns_hashing_impl::HashMethodBase<
   * That is, for example, for strings, it contains first the serialized length of the string, and then the bytes.
   * Therefore, when aggregating by several strings, there is no ambiguity.
   */
-template <typename Value, typename Mapped>
+template <typename Value, typename Mapped, bool keys_pre_serialized = false>
 struct HashMethodSerialized
-        : public columns_hashing_impl::HashMethodBase<HashMethodSerialized<Value, Mapped>, Value,
-                                                      Mapped, false> {
-    using Self = HashMethodSerialized<Value, Mapped>;
+        : public columns_hashing_impl::HashMethodBase<
+                  HashMethodSerialized<Value, Mapped, keys_pre_serialized>, Value, Mapped, false> {
+    using Self = HashMethodSerialized<Value, Mapped, keys_pre_serialized>;
     using Base = columns_hashing_impl::HashMethodBase<Self, Value, Mapped, false>;
+    using KeyHolderType =
+            std::conditional_t<keys_pre_serialized, ArenaKeyHolder, SerializedKeyHolder>;
 
     ColumnRawPtrs key_columns;
     size_t keys_size;
+    const StringRef* keys;
 
     HashMethodSerialized(const ColumnRawPtrs& key_columns_, const Sizes& /*key_sizes*/,
                          const HashMethodContextPtr&)
             : key_columns(key_columns_), keys_size(key_columns_.size()) {}
 
+    void set_serialized_keys(StringRef* keys_) { keys = keys_; }

Review Comment:
   Maybe we can add const here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] mrhhsg commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
mrhhsg commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916913762


##########
be/src/vec/columns/column_nullable.cpp:
##########
@@ -134,6 +134,24 @@ const char* ColumnNullable::deserialize_and_insert_from_arena(const char* pos) {
     return pos;
 }
 
+size_t ColumnNullable::get_max_row_byte_size() const {
+    constexpr auto flag_size = sizeof(get_null_map_data()[0]);
+    return flag_size + get_nested_column().get_max_row_byte_size();
+}
+
+void ColumnNullable::serialize_vec(std::vector<StringRef>& keys, size_t num_rows,
+                                   size_t max_row_byte_size) const {
+    const auto& arr = get_null_map_data();
+    static constexpr auto s = sizeof(arr[0]);
+    for (size_t i = 0; i < num_rows; ++i) {
+        auto* val = const_cast<char*>(keys[i].data + keys[i].size);
+        *val = (arr[i] ? 1 : 0);

Review Comment:
   Value of `NULL` may be 1 or JOIN_NULL_HINT(2)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] yiguolei commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
yiguolei commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916669026


##########
be/src/vec/columns/column.h:
##########
@@ -246,6 +246,14 @@ class IColumn : public COW<IColumn> {
     /// Returns pointer to the position after the read data.
     virtual const char* deserialize_and_insert_from_arena(const char* pos) = 0;
 
+    virtual size_t get_max_row_byte_size() const { return 0; }

Review Comment:
   Add some comments for new method. Then other people could read the code more clearly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] BiteTheDDDDt commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
BiteTheDDDDt commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916847546


##########
be/src/vec/columns/column_nullable.cpp:
##########
@@ -134,6 +134,24 @@ const char* ColumnNullable::deserialize_and_insert_from_arena(const char* pos) {
     return pos;
 }
 
+size_t ColumnNullable::get_max_row_byte_size() const {
+    constexpr auto flag_size = sizeof(get_null_map_data()[0]);

Review Comment:
   Maybe we can just use NullMap::T



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] BiteTheDDDDt commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
BiteTheDDDDt commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916872115


##########
be/src/vec/exec/vaggregation_node.h:
##########
@@ -50,13 +50,41 @@ struct AggregationMethodSerialized {
     Data data;
     Iterator iterator;
     bool inited = false;
+    std::vector<StringRef> keys;
+    AggregationMethodSerialized()
+            : _serialized_key_buffer_size(0),
+              _serialized_key_buffer(nullptr),
+              _mem_pool(new MemPool) {}
 
-    AggregationMethodSerialized() = default;
+    using State = ColumnsHashing::HashMethodSerialized<typename Data::value_type, Mapped, true>;
 
     template <typename Other>
     explicit AggregationMethodSerialized(const Other& other) : data(other.data) {}
 
-    using State = ColumnsHashing::HashMethodSerialized<typename Data::value_type, Mapped>;
+    void serialize_keys(const ColumnRawPtrs& key_columns, const size_t num_rows) {
+        size_t max_one_row_byte_size = 0;
+        for (const auto& column : key_columns) {
+            max_one_row_byte_size += column->get_max_row_byte_size();

Review Comment:
   Does we should consider case that some string column have few long string? This may increase a lot of memory allocation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] BiteTheDDDDt commented on a diff in pull request #10700: [improvement]pre-serialize aggregation keys

Posted by GitBox <gi...@apache.org>.
BiteTheDDDDt commented on code in PR #10700:
URL: https://github.com/apache/doris/pull/10700#discussion_r916867517


##########
be/src/vec/exec/vaggregation_node.cpp:
##########
@@ -1034,6 +1049,12 @@ Status AggregationNode::_merge_with_serialized_key(Block* block) {
                 using HashMethodType = std::decay_t<decltype(agg_method)>;
                 using AggState = typename HashMethodType::State;
                 AggState state(key_columns, _probe_key_sz, nullptr);
+                if constexpr (ColumnsHashing::IsPreSerializedKeysHashMethodTraits<
+                                      AggState>::value) {
+                    SCOPED_TIMER(_serialize_key_timer);

Review Comment:
   Maybe we can do some abstract for those same code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org