You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by mo...@apache.org on 2022/10/23 14:52:04 UTC
[doris] branch master updated: [doc](random_sink) Add some doc content about random sink (#13577)

This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new 87864e40bf [doc](random_sink) Add some doc content about random sink (#13577)
87864e40bf is described below

commit 87864e40bfc450ca81cadf4f8784a9b1ac8bb23c
Author: caiconghui <55...@users.noreply.github.com>
AuthorDate: Sun Oct 23 22:51:56 2022 +0800

    [doc](random_sink) Add some doc content about random sink (#13577)
    
    1. Add some doc content about random sink
    2. Fix bug of showing missing rowsets info
---
 be/src/olap/tablet.cpp                             |  2 +-
 docs/en/docs/data-table/data-partition.md          |  9 ++++--
 .../Create/CREATE-TABLE.md                         | 11 ++++++-
 .../Load/BROKER-LOAD.md                            |  8 +++++
 .../Load/CREATE-ROUTINE-LOAD.md                    |  6 ++++
 .../Load/STREAM-LOAD.md                            |  3 +-
 docs/zh-CN/docs/data-table/data-partition.md       |  8 ++++-
 .../Create/CREATE-TABLE.md                         | 13 ++++++--
 .../Load/BROKER-LOAD.md                            | 36 +++++++++++++---------
 .../Load/CREATE-ROUTINE-LOAD.md                    |  8 +++++
 .../Load/STREAM-LOAD.md                            |  3 +-
 11 files changed, 84 insertions(+), 23 deletions(-)

diff --git a/be/src/olap/tablet.cpp b/be/src/olap/tablet.cpp
index 2798b82f74..4524f1e89f 100644
--- a/be/src/olap/tablet.cpp
+++ b/be/src/olap/tablet.cpp
@@ -1246,7 +1246,7 @@ void Tablet::get_compaction_status(std::string* json_result) {
         if (ver.first != last_version + 1) {
             rapidjson::Value miss_value;
             miss_value.SetString(
-                    strings::Substitute("[$0-$1]", last_version + 1, ver.first).c_str(),
+                    strings::Substitute("[$0-$1]", last_version + 1, ver.first - 1).c_str(),
                     missing_versions_arr.GetAllocator());
             missing_versions_arr.PushBack(miss_value, missing_versions_arr.GetAllocator());
         }
diff --git a/docs/en/docs/data-table/data-partition.md b/docs/en/docs/data-table/data-partition.md
index d5bebc709f..8cc4102433 100644
--- a/docs/en/docs/data-table/data-partition.md
+++ b/docs/en/docs/data-table/data-partition.md
@@ -332,14 +332,19 @@ It is also possible to use only one layer of partitioning. When using a layer pa
     * Give some examples: Suppose there are 10 BEs, one for each BE disk. If the total size of a table is 500MB, you can consider 4-8 shards. 5GB: 8-16. 50GB: 32. 500GB: Recommended partitions, each partition is about 50GB in size, with 16-32 shards per partition. 5TB: Recommended partitions, each with a size of around 50GB and 16-32 shards per partition.
     
     > Note: The amount of data in the table can be viewed by the [show data](../sql-manual/sql-reference/Show-Statements/SHOW-DATA.md) command. The result is divided by the number of copies, which is the amount of data in the table.
-    
+
+4. About the settings and usage scenarios of Random Distribution.
+
+    * If the OLAP table does not have columns with replace type, set the data bucket mode of the table to RANDOM to avoid the severe data skew(When data is loaded into the partition corresponding to the table, each batch of data in a single load task will randomly select a tablet to write)).
+    * When the bucket distribution mode of the table is set to RANDOM, because there is no bucket distribution column, it is not possible to query only a few buckets based on the bucket distribution column values. When querying the table, all buckets int the hit partition will be scanned at the same time. This setting is suitable for aggregate query analysis of the table data as a whole, but not for highly concurrent point queries.
+    * If the data distribution of the OLAP table is Random Distribution, you can set the load to single tablet mode (set 'load_to_single_tablet' to true) when importing data. When importing large amounts of data, a task will only write one tablet when writing data to the corresponding partition. This will improve the concurrency and throughput of data import and reduce the problem of write amplification caused by data import and compaction, finally ensure the stability of the cluster.
 
 #### Compound Partitions vs Single Partitions
 
 Compound Partitions
 
 - The first level is called Partition, which is partition. Users can specify a dimension column as a partition column (currently only columns of integer and time types are supported), and specify the value range of each partition.
-- The second level is called Distribution, which means bucketing. Users can specify one or more dimension columns and the number of buckets to perform HASH distribution on the data.
+- The second level is called Distribution, which means bucketing. Users can specify one or more dimension columns and the number of buckets to perform HASH distribution on the data or set it to Random Distribution with not specifying the bucket distribution column to randomly distribute the data.
 
 Composite partitions are recommended for the following scenarios
 
diff --git a/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md b/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md
index 007d56c408..f1043c8644 100644
--- a/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md
+++ b/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md
@@ -230,7 +230,16 @@ distribution_desc
 
     Define the data bucketing method.
 
-    `DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num]`
+    1) Hash
+       Syntax:
+       `DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num]`
+       Explain:
+       Hash bucketing using the specified key column.
+    2) Random
+       Syntax:
+       `DISTRIBUTED BY RANDOM [BUCKETS num]`
+       Explain:
+       Use random numbers for bucketing.
 
 * `rollup_list`
 
diff --git a/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md b/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md
index 2b6dedb31a..f8901e13b5 100644
--- a/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md
+++ b/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md
@@ -173,6 +173,14 @@ WITH BROKER broker_name
     It allows the user to set the parallelism of the load execution plan
     on a single node when the broker load is submitted, default value is 1.
 
+  - `send_batch_parallelism`
+  
+    Used to set the default parallelism for sending batch, if the value for parallelism exceed `max_send_batch_parallelism_per_job` in BE config, then the coordinator BE will use the value of `max_send_batch_parallelism_per_job`. 
+    
+  - `load_to_single_tablet`
+  
+    Boolean type, True means that one task can only load data to one tablet in the corresponding partition at a time. The default value is false. The number of tasks for the job depends on the overall concurrency. This parameter can only be set when loading data into the OLAP table with random partition.
+
 ### Example
 
 1. Import a batch of data from HDFS
diff --git a/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md b/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md
index c6ec707cf9..7fc7401853 100644
--- a/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md
+++ b/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md
@@ -218,6 +218,12 @@ FROM data_source [data_source_properties]
      When the import data format is json, you can specify the root node of the Json data through json_root. Doris will extract the elements of the root node through json_root for parsing. Default is empty.
 
      `-H "json_root: $.RECORDS"`
+  10. `send_batch_parallelism`
+     
+     Integer, Used to set the default parallelism for sending batch, if the value for parallelism exceed `max_send_batch_parallelism_per_job` in BE config, then the coordinator BE will use the value of `max_send_batch_parallelism_per_job`.
+  
+  11. `load_to_single_tablet`
+      Boolean type, True means that one task can only load data to one tablet in the corresponding partition at a time. The default value is false. This parameter can only be set when loading data into the OLAP table with random partition.
 
 - `FROM data_source [data_source_properties]`
 
diff --git a/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md b/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md
index a976fed117..07efc31ae9 100644
--- a/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md
+++ b/docs/en/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md
@@ -144,7 +144,8 @@ separated by commas.
            The system will use the order specified by user. in case above, data should be ended
            with __DORIS_SEQUENCE_COL__.
        ```
-
+23. load_to_single_tablet: Boolean type, True means that one task can only load data to one tablet in the corresponding partition at a time. The default value is false. This parameter can only be set when loading data into the OLAP table with random partition.
+    
     RETURN VALUES
         After the import is complete, the related content of this import will be returned in Json format. Currently includes the following fields
         Status: Import the last status.
diff --git a/docs/zh-CN/docs/data-table/data-partition.md b/docs/zh-CN/docs/data-table/data-partition.md
index faa9d9f7ad..fc4ae78acc 100644
--- a/docs/zh-CN/docs/data-table/data-partition.md
+++ b/docs/zh-CN/docs/data-table/data-partition.md
@@ -338,12 +338,18 @@ Doris 支持两层的数据划分。第一层是 Partition，支持 Range 和 Li
 
    > 注：表的数据量可以通过 [`SHOW DATA`](../sql-manual/sql-reference/Show-Statements/SHOW-DATA.md) 命令查看，结果除以副本数，即表的数据量。
 
+4. **关于 Random Distribution 的设置以及使用场景。**   
+    - 如果 OLAP 表没有更新类型的字段，将表的数据分桶模式设置为 RANDOM，则可以避免严重的数据倾斜(数据在导入表对应的分区的时候，单次导入作业每个 batch 的数据将随机选择一个tablet进行写入)。
+    - 当表的分桶模式被设置为RANDOM 时，因为没有分桶列，无法根据分桶列的值仅对几个分桶查询，对表进行查询的时候将对命中分区的全部分桶同时扫描，该设置适合对表数据整体的聚合查询分析而不适合高并发的点查询。
+    - 如果 OLAP 表的是 Random Distribution 的数据分布，那么在数据导入的时候可以设置单分片导入模式（将 `load_to_single_tablet` 设置为 true），那么在大数据量的导入的时候，一个任务在将数据写入对应的分区时将只写入一个分片，这样将能提高数据导入的并发度和吞吐量，减少数据导入和 Compaction
+    导致的写放大问题，保障集群的稳定性。 
+
 #### 复合分区与单分区
 
 复合分区
 
 - 第一级称为 Partition，即分区。用户可以指定某一维度列作为分区列（当前只支持整型和时间类型的列），并指定每个分区的取值范围。
-- 第二级称为 Distribution，即分桶。用户可以指定一个或多个维度列以及桶数对数据进行 HASH 分布。
+- 第二级称为 Distribution，即分桶。用户可以指定一个或多个维度列以及桶数对数据进行 HASH 分布 或者不指定分桶列设置成 Random Distribution 对数据进行随机分布。
 
 以下场景推荐使用复合分区
 
diff --git a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md
index fea9503fc3..ce32b80816 100644
--- a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md
+++ b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md
@@ -147,7 +147,7 @@ distribution_desc
         v4 INT SUM NOT NULL DEFAULT "1" COMMENT "This is column v4"
         ```
     
-*  `index_definition_list`
+* `index_definition_list`
 
     索引列表定义：
     
@@ -231,7 +231,16 @@ distribution_desc
   
     定义数据分桶方式。
 
-    `DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num]`
+    1) Hash 分桶
+       语法：
+          `DISTRIBUTED BY HASH (k1[,k2 ...]) [BUCKETS num]`
+       说明：
+          使用指定的 key 列进行哈希分桶。
+    2) Random 分桶
+       语法：
+          `DISTRIBUTED BY RANDOM [BUCKETS num]`
+       说明：
+          使用随机数进行分桶。 
 
 * `rollup_list`
 
diff --git a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md
index be6a861e10..83459c322c 100644
--- a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md
+++ b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/BROKER-LOAD.md
@@ -144,33 +144,41 @@ WITH BROKER broker_name
   )
   ```
 
-- `load_properties`
+  - `load_properties`
 
-  指定导入的相关参数。目前支持以下参数：
+    指定导入的相关参数。目前支持以下参数：
 
-  - `timeout`
+    - `timeout`
 
-    导入超时时间。默认为 4 小时。单位秒。
+      导入超时时间。默认为 4 小时。单位秒。
 
-  - `max_filter_ratio`
+    - `max_filter_ratio`
 
-    最大容忍可过滤（数据不规范等原因）的数据比例。默认零容忍。取值范围为 0 到 1。
+      最大容忍可过滤（数据不规范等原因）的数据比例。默认零容忍。取值范围为 0 到 1。
 
-  - `exec_mem_limit`
+    - `exec_mem_limit`
 
-    导入内存限制。默认为 2GB。单位为字节。
+      导入内存限制。默认为 2GB。单位为字节。
 
-  - `strict_mode`
+    - `strict_mode`
 
-    是否对数据进行严格限制。默认为 false。
+      是否对数据进行严格限制。默认为 false。
 
-  - `timezone`
+    - `timezone`
 
-    指定某些受时区影响的函数的时区，如 `strftime/alignment_timestamp/from_unixtime` 等等，具体请查阅 [时区](../../../../advanced/time-zone) 文档。如果不指定，则使用 "Asia/Shanghai" 时区
+      指定某些受时区影响的函数的时区，如 `strftime/alignment_timestamp/from_unixtime` 等等，具体请查阅 [时区](../../../../advanced/time-zone) 文档。如果不指定，则使用 "Asia/Shanghai" 时区
 
-  - `load_parallelism`
+    - `load_parallelism`
 
-    导入并发度，默认为1。调大导入并发度会启动多个执行计划同时执行导入任务，加快导入速度。
+      导入并发度，默认为1。调大导入并发度会启动多个执行计划同时执行导入任务，加快导入速度。 
+
+    - `send_batch_parallelism`
+    
+      用于设置发送批处理数据的并行度，如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`，那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。
+    
+    - `load_to_single_tablet`
+      
+      布尔类型，为true表示支持一个任务只导入数据到对应分区的一个tablet，默认值为false，作业的任务数取决于整体并发度。该参数只允许在对带有random分区的olap表导数的时候设置。
 
 ### Example
 
diff --git a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md
index 6ced9ee91e..a0d86f0000 100644
--- a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md
+++ b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/CREATE-ROUTINE-LOAD.md
@@ -219,6 +219,14 @@ FROM data_source [data_source_properties]
      当导入数据格式为 json 时，可以通过 json_root 指定 Json 数据的根节点。Doris 将通过 json_root 抽取根节点的元素进行解析。默认为空。
 
      `-H "json_root: $.RECORDS"`
+  
+  10. `send_batch_parallelism`
+
+      整型，用于设置发送批处理数据的并行度，如果并行度的值超过 BE 配置中的 `max_send_batch_parallelism_per_job`，那么作为协调点的 BE 将使用 `max_send_batch_parallelism_per_job` 的值。 
+
+  11. `load_to_single_tablet`
+
+      布尔类型，为 true 表示支持一个任务只导入数据到对应分区的一个 tablet，默认值为 false，该参数只允许在对带有 random 分区的 olap 表导数的时候设置。
 
 - `FROM data_source [data_source_properties]`
 
diff --git a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md
index 444513fdfc..103640934c 100644
--- a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md
+++ b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Manipulation-Statements/Load/STREAM-LOAD.md
@@ -126,7 +126,7 @@ curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_h
 17. delete: 仅在 MERGE下有意义， 表示数据的删除条件
         function_column.sequence_col: 只适用于UNIQUE_KEYS,相同key列下，保证value列按照source_sequence列进行REPLACE, source_sequence可以是数据源中的列，也可以是表结构中的一列。
     
-18. fuzzy_parse: 布尔类型，为true表示json将以第一行为schema 进行解析，开启这个选项可以提高json 导入效率，但是要求所有json 对象的key的顺序和第一行一致， 默认为false，仅用于json 格式
+18. fuzzy_parse: 布尔类型，为true表示json将以第一行为schema 进行解析，开启这个选项可以提高 json 导入效率，但是要求所有json 对象的key的顺序和第一行一致， 默认为false，仅用于json 格式
     
 19. num_as_string: 布尔类型，为true表示在解析json数据时会将数字类型转为字符串，然后在确保不会出现精度丢失的情况下进行导入。
     
@@ -139,6 +139,7 @@ curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_h
            hidden_columns: __DORIS_DELETE_SIGN__,__DORIS_SEQUENCE_COL__
            系统会使用用户指定的数据导入数据。在上述用例中，导入数据中最后一列数据为__DORIS_SEQUENCE_COL__。
        ```
+23. load_to_single_tablet: 布尔类型，为true表示支持一个任务只导入数据到对应分区的一个 tablet，默认值为 false，该参数只允许在对带有 random 分区的 olap 表导数的时候设置。
 
     RETURN VALUES
         导入完成后，会以Json格式返回这次导入的相关内容。当前包括以下字段


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org