You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by ji...@apache.org on 2022/07/29 05:05:35 UTC
[doris] branch master updated: [doc]Added auto_broadcast_join_threshold variable description (#11323)

This is an automated email from the ASF dual-hosted git repository.

jiafengzheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 6d0c59d4f0 [doc]Added auto_broadcast_join_threshold variable description  (#11323)
6d0c59d4f0 is described below

commit 6d0c59d4f07be4eb0fa82d056c2ac5b6c0679ec7
Author: jiafeng.zhang <zh...@gmail.com>
AuthorDate: Fri Jul 29 13:05:30 2022 +0800

    [doc]Added auto_broadcast_join_threshold variable description  (#11323)
    
    Add auto_broadcast_join_threshold variable description
---
 docs/en/docs/advanced/variables.md    | 21 +++++++++++++++++++++
 docs/zh-CN/docs/advanced/variables.md | 20 ++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/docs/en/docs/advanced/variables.md b/docs/en/docs/advanced/variables.md
index a8960c4150..016ab933c1 100644
--- a/docs/en/docs/advanced/variables.md
+++ b/docs/en/docs/advanced/variables.md
@@ -114,6 +114,27 @@ Note that the comment must start with /*+ and can only follow the SELECT.
 
     Used for compatibility with MySQL clients. No practical effect.
     
+* `auto_broadcast_join_threshold`
+
+    The maximum size in bytes of the table that will be broadcast to all nodes when a join is performed, broadcast can be disabled by setting this value to -1.
+
+    The system provides two join implementation methods, `broadcast join` and `shuffle join`.
+
+    `broadcast join` means that after conditional filtering the small table, broadcast it to each node where the large table is located to form an in-memory Hash table, and then stream the data of the large table for Hash Join.
+
+    `shuffle join` refers to hashing both small and large tables according to the join key, and then performing distributed join.
+
+    `broadcast join` has better performance when the data volume of the small table is small. On the contrary, shuffle join has better performance.
+
+    The system will automatically try to perform a Broadcast Join, or you can explicitly specify the implementation of each join operator. The system provides a configurable parameter `auto_broadcast_join_threshold`, which specifies the upper limit of the memory used by the hash table to the overall execution memory when `broadcast join` is used. The value ranges from 0 to 1, and the default value is 0.8. When the memory used by the system to calculate the hash table exceeds this limit,  [...]
+
+    The overall execution memory here is: a fraction of what the query optimizer estimates
+
+    > Note:
+    >
+    > It is not recommended to use this parameter to adjust, if you must use a certain join, it is recommended to use hint, such as join[shuffle]
+
+
 * `batch_size`
 
     Used to specify the number of rows of a single packet transmitted by each node during query execution. By default, the number of rows of a packet is 1024 rows. That is, after the source node generates 1024 rows of data, it is packaged and sent to the destination node.
diff --git a/docs/zh-CN/docs/advanced/variables.md b/docs/zh-CN/docs/advanced/variables.md
index 5670c43120..3b938c42a2 100644
--- a/docs/zh-CN/docs/advanced/variables.md
+++ b/docs/zh-CN/docs/advanced/variables.md
@@ -113,6 +113,26 @@ SELECT /*+ SET_VAR(query_timeout = 1, enable_partition_cache=true) */ sleep(3);
 
   用于兼容 MySQL 客户端。无实际作用。
 
+- `auto_broadcast_join_threshold`
+
+  执行连接时将向所有节点广播的表的最大字节大小，通过将此值设置为 -1 可以禁用广播。
+
+  系统提供了两种 Join 的实现方式，`broadcast join` 和 `shuffle join`。
+
+  `broadcast join` 是指将小表进行条件过滤后，将其广播到大表所在的各个节点上，形成一个内存 Hash 表，然后流式读出大表的数据进行 Hash Join。
+
+  `shuffle join` 是指将小表和大表都按照 Join 的 key 进行 Hash，然后进行分布式的 Join。
+
+  当小表的数据量较小时，`broadcast join` 拥有更好的性能。反之，则shuffle join拥有更好的性能。
+
+  系统会自动尝试进行 Broadcast Join，也可以显式指定每个join算子的实现方式。系统提供了可配置的参数 `auto_broadcast_join_threshold`，指定使用 `broadcast join` 时，hash table 使用的内存占整体执行内存比例的上限，取值范围为0到1，默认值为0.8。当系统计算hash table使用的内存会超过此限制时，会自动转换为使用 `shuffle join`
+
+  这里的整体执行内存是：查询优化器做估算的一个比例
+
+  >注意：
+  >
+  >不建议用这个参数来调整，如果必须要使用某一种join，建议使用hint，比如 join[shuffle]
+
 - `batch_size`
 
   用于指定在查询执行过程中，各个节点传输的单个数据包的行数。默认一个数据包的行数为 1024 行，即源端节点每产生 1024 行数据后，打包发给目的节点。


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org