You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ya...@apache.org on 2021/03/21 09:39:07 UTC
[incubator-doris] 01/03: [Bug] Fix the memory expand 10~1000x of compression algorithm (#5504)

This is an automated email from the ASF dual-hosted git repository.

yangzhg pushed a commit to branch branch-0.14
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git

commit f1860f7258338a665750fb27c3079146ed3a45db
Author: HappenLee <ha...@hotmail.com>
AuthorDate: Fri Mar 12 23:04:07 2021 +0800

    [Bug] Fix the memory expand 10~1000x of compression algorithm (#5504)
    
    Fix the memory expand 10~1000x of compression algorithm in load and compaction
    
    (cherry picked from commit e9a73ee278250ef09fec1963dff91ed29fdd42dd)
---
 be/src/olap/rowset/segment_v2/page_io.cpp          |   3 +
 docs/en/administrator-guide/bucket-shuffle-join.md | 105 ++++++++++++++++++++
 .../administrator-guide/bucket-shuffle-join.md     | 106 +++++++++++++++++++++
 3 files changed, 214 insertions(+)

diff --git a/be/src/olap/rowset/segment_v2/page_io.cpp b/be/src/olap/rowset/segment_v2/page_io.cpp
index f59174c..16037da 100644
--- a/be/src/olap/rowset/segment_v2/page_io.cpp
+++ b/be/src/olap/rowset/segment_v2/page_io.cpp
@@ -50,6 +50,9 @@ Status PageIO::compress_page_body(const BlockCompressionCodec* codec, double min
         double space_saving = 1.0 - static_cast<double>(buf.size()) / uncompressed_size;
         // return compressed body only when it saves more than min_space_saving
         if (space_saving > 0 && space_saving >= min_space_saving) {
+            // shrink the buf to fit the len size to avoid taking
+            // up the memory of the size MAX_COMPRESSED_SIZE
+            buf.shrink_to_fit();
             *compressed_body = buf.build();
             return Status::OK();
         }
diff --git a/docs/en/administrator-guide/bucket-shuffle-join.md b/docs/en/administrator-guide/bucket-shuffle-join.md
new file mode 100644
index 0000000..0e67268
--- /dev/null
+++ b/docs/en/administrator-guide/bucket-shuffle-join.md
@@ -0,0 +1,105 @@
+---
+{
+    "title": "Bucket Shuffle Join",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Bucket Shuffle Join
+
+Bucket Shuffle Join is a new function officially added in Doris 0.14. The purpose is to provide local optimization for some join queries to reduce the time-consuming of data transmission between nodes and speed up the query.
+
+It's design, implementation can be referred to [ISSUE 4394](https://github.com/apache/incubator-doris/issues/4394)。
+
+## Noun Interpretation
+
+* FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access.
+* BE: Backend, Doris's back-end node. Responsible for query execution and data storage.
+* Left table: the left table in join query. Perform probe expr. The order can be adjusted by join reorder.
+* Right table: the right table in join query. Perform build expr The order can be adjusted by join reorder.
+
+## Principle
+The conventional distributed join methods supported by Doris is: `Shuffle Join, Broadcast Join`. Both of these join will lead to some network overhead.
+
+For example, there are join queries for table A and table B. the join method is hashjoin. The cost of different join types is as follows：
+* **Broadcast Join**: If table a has three executing hashjoinnodes according to the data distribution, table B needs to be sent to the three HashJoinNode. Its network overhead is `3B `, and its memory overhead is `3B`. 
+* **Shuffle Join**: Shuffle join will distribute the data of tables A and B to the nodes of the cluster according to hash calculation, so its network overhead is `A + B` and memory overhead is `B`.
+
+The data distribution information of each Doris table is saved in FE. If the join statement hits the data distribution column of the left table, we should use the data distribution information to reduce the network and memory overhead of the join query. This is the source of the idea of bucket shuffle join.
+
+![image.png](/images/bucket_shuffle_join.png)
+
+The picture above shows how the Bucket Shuffle Join works. The SQL query is A table join B table. The equivalent expression of join hits the data distribution column of A. According to the data distribution information of table A. Bucket Shuffle Join sends the data of table B to the corresponding data storage and calculation node of table A. The cost of Bucket Shuffle Join is as follows:
+
+* network cost： ``` B < min(3B, A + B) ```
+
+* memory cost： ``` B <= min(3B, B) ```
+
+Therefore, compared with Broadcast Join and Shuffle Join, Bucket shuffle join has obvious performance advantages. It reduces the time-consuming of data transmission between nodes and the memory cost of join. Compared with Doris's original join method, it has the following advantages
+
+* First of all, Bucket Shuffle Join reduces the network and memory cost which makes some join queries have better performance. Especially when FE can perform partition clipping and bucket clipping of the left table.
+* Secondly, unlike Colorate Join, it is not intrusive to the data distribution of tables, which is transparent to users. There is no mandatory requirement for the data distribution of the table, which is not easy to lead to the problem of data skew.
+* Finally, it can provide more optimization space for join reorder.
+
+## Usage
+
+### Set session variable
+
+Set session variable `enable_bucket_shuffle_join` to `true`, FE will automatically plan queries that can be converted to Bucket Shuffle Join.
+
+```
+set enable_bucket_shuffle_join = true;
+```
+
+In FE's distributed query planning, the priority order is Colorate Join -> Bucket Shuffle Join -> Broadcast Join -> Shuffle Join. However, if the user explicitly hints the type of join, for example: 
+
+```
+select * from test join [shuffle] baseall on test.k1 = baseall.k1;
+```
+the above order of preference will not take effect.
+
+The session variable is set to `true` by default in version 0.14, while it needs to be set to `true` manually in version 0.13.
+
+### View the type of join
+
+You can use the `explain` command to check whether the join is a Bucket Shuffle Join
+
+```
+|   2:HASH JOIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [...]
+|   |  join op: INNER JOIN (BUCKET_SHUFFLE)                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [...]
+|   |  hash predicates:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [...]
+|   |  colocate: false, reason: table not in the same group                                                                                                                                                                                                                                                                                                                                                                                                                                                  [...]
+|   |  equal join conjunct: `test`.`k1` = `baseall`.`k1`                                         
+```
+
+The join type indicates that the join method to be used is：`BUCKET_SHUFFLE`。
+
+## Planning rules of Bucket Shuffle Join
+
+In most scenarios, users only need to turn on the seesion variable by default to transparently use the performance improvement brought by this join method. However, if we understand the planning rules of Bucket Shuffle Join, we can use it to write more efficient SQL.
+
+* Bucket Shuffle Join only works when the join condition is equivalent. The reason is similar to Colorate Join. They all rely on hash to calculate the determined data distribution.
+* The bucket column of two tables is included in the equivalent join condition. When the bucket column of the left table is an equivalent join condition, it has a high probability of being planned as a Bucket Shuffle Join.
+* Because the hash values of different data types have different calculation results. Bucket Shuffle Join requires that the bucket column type of the left table and the equivalent join column type of the right table should be consistent, otherwise the corresponding planning cannot be carried out.
+* Bucket Shuffle Join only works on Doris native OLAP tables. For ODBC, MySQL, ES External Table, when they are used as left tables, they cannot be planned as Bucket Shuffle Join.
+* For partitioned tables, because the data distribution rules of each partition may be different, the Bucket Shuffle Join can only guarantee that the left table is a single partition. Therefore, in SQL execution, we need to use the `where` condition as far as possible to make the partition clipping policy effective.
+* If the left table is a colorate table, the data distribution rules of each partition are determined. So the bucket shuffle join can perform better on the colorate table.
diff --git a/docs/zh-CN/administrator-guide/bucket-shuffle-join.md b/docs/zh-CN/administrator-guide/bucket-shuffle-join.md
new file mode 100644
index 0000000..38cd33a
--- /dev/null
+++ b/docs/zh-CN/administrator-guide/bucket-shuffle-join.md
@@ -0,0 +1,106 @@
+---
+{
+    "title": "Bucket Shuffle Join",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Bucket Shuffle Join
+
+Bucket Shuffle Join 是在 Doris 0.14 版本中正式加入的新功能。旨在为某些 Join 查询提供本地性优化，来减少数据在节点间的传输耗时，来加速查询。
+
+它的设计、实现和效果可以参阅 [ISSUE 4394](https://github.com/apache/incubator-doris/issues/4394)。
+
+## 名词解释
+
+* FE：Frontend，Doris 的前端节点。负责元数据管理和请求接入。
+* BE：Backend，Doris 的后端节点。负责查询执行和数据存储。
+* 左表：Join查询时，左边的表。进行Probe操作。可被Join Reorder调整顺序。
+* 右表：Join查询时，右边的表。进行Build操作。可被Join Reorder调整顺序。
+
+## 原理
+Doris支持的常规分布式Join方式包括了shuffle join 和broadcast join。这两种join都会导致不小的网络开销:
+
+举个例子，当前存在A表与B表的Join查询，它的Join方式为HashJoin，不同Join类型的开销如下：
+* **Broadcast Join**: 如果根据数据分布，查询规划出A表有3个执行的HashJoinNode，那么需要将B表全量的发送到3个HashJoinNode，那么它的网络开销是`3B`，它的内存开销也是`3B`。 
+* **Shuffle Join**: Shuffle Join会将A，B两张表的数据根据哈希计算分散到集群的节点之中，所以它的网络开销为 ```A + B```，内存开销为`B`。
+
+在FE之中保存了Doris每个表的数据分布信息，如果join语句命中了表的数据分布列，我们应该使用数据分布信息来减少join语句的网络与内存开销，这就是Bucket Shuffle Join的思路来源。
+
+![image.png](/images/bucket_shuffle_join.png)
+
+上面的图片展示了Bucket Shuffle Join的工作原理。SQL语句为 A表 join B表，并且join的等值表达式命中了A的数据分布列。而Bucket Shuffle Join会根据A表的数据分布信息，将B表的数据发送到对应的A表的数据存储计算节点。Bucket Shuffle Join开销如下：
+
+* 网络开销： ``` B < min(3B, A + B) ```
+
+* 内存开销： ``` B <= min(3B, B) ```
+
+可见，相比于Broadcast Join与Shuffle Join， Bucket Shuffle Join有着较为明显的性能优势。减少数据在节点间的传输耗时和Join时的内存开销。相对于Doris原有的Join方式，它有着下面的优点
+
+* 首先，Bucket-Shuffle-Join降低了网络与内存开销，使一些Join查询具有了更好的性能。尤其是当FE能够执行左表的分区裁剪与桶裁剪时。
+* 其次，同时与Colocate Join不同，它对于表的数据分布方式并没有侵入性，这对于用户来说是透明的。对于表的数据分布没有强制性的要求，不容易导致数据倾斜的问题。
+* 最后，它可以为Join Reorder提供更多可能的优化空间。
+
+## 使用方式
+
+### 设置Session变量
+
+将session变量`enable_bucket_shuffle_join`设置为`true`，则FE在进行查询规划时就会默认将能够转换为Bucket Shuffle Join的查询自动规划为Bucket Shuffle Join。
+
+```
+set enable_bucket_shuffle_join = true;
+```
+
+在FE进行分布式查询规划时，优先选择的顺序为 Colocate Join -> Bucket Shuffle Join -> Brocast Join -> Shuffle Join。但是如果用户显式hint了Join的类型，如：    
+
+```
+select * from test join [shuffle] baseall on test.k1 = baseall.k1;
+```
+
+则上述的选择优先顺序则不生效。
+
+该session变量在0.14版本默认为`true`, 而0.13版本需要手动设置为`true`。
+
+### 查看Join的类型
+
+可以通过`explain`命令来查看Join是否为Bucket Shuffle Join：
+
+```
+|   2:HASH JOIN                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              [...]
+|   |  join op: INNER JOIN (BUCKET_SHUFFLE)                                                                                                                                                                                                                                                                                                                                                                                                                                                                  [...]
+|   |  hash predicates:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [...]
+|   |  colocate: false, reason: table not in the same group                                                                                                                                                                                                                                                                                                                                                                                                                                                  [...]
+|   |  equal join conjunct: `test`.`k1` = `baseall`.`k1`                                         
+```
+
+在Join类型之中会指明使用的Join方式为：`BUCKET_SHUFFLE`。
+
+## Bucket Shuffle Join的规划规则
+
+在绝大多数场景之中，用户只需要默认打开seesion变量的开关就可以透明的使用这种Join方式带来的性能提升，但是如果了解Bucket Shuffle Join的规划规则，可以帮助我们利用它写出更加高效的SQL。
+
+* Bucket Shuffle Join只生效于Join条件为等值的场景，原因与Colocate Join类似，它们都依赖hash来计算确定的数据分布。
+* 在等值Join条件之中包含两张表的分桶列，当左表的分桶列为等值的Join条件时，它有很大概率会被规划为Bucket Shuffle Join。
+* 由于不同的数据类型的hash值计算结果不同，所以Bucket Shuffle Join要求左表的分桶列的类型与右表等值join列的类型需要保持一致，否则无法进行对应的规划。
+* Bucket Shuffle Join只作用于Doris原生的OLAP表，对于ODBC，MySQL，ES等外表，当其作为左表时是无法规划生效的。
+* 对于分区表，由于每一个分区的数据分布规则可能不同，所以Bucket Shuffle Join只能保证左表为单分区时生效。所以在SQL执行之中，需要尽量使用`where`条件使分区裁剪的策略能够生效。
+* 假如左表为Colocate的表，那么它每个分区的数据分布规则是确定的，Bucket Shuffle Join能在Colocate表上表现更好。

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org