You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/04/30 08:26:40 UTC
[GitHub] [incubator-doris] imay commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

imay commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r417824889



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",

Review comment:
       Is "Spark Load" a good name? Maybe we will support Hadoop or Hive later, however they will share the same load framework.
   So we should give this feature a common name, and spark is only one of all methods.

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。
+  - `broker`：broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`： HDFS YARN 参数，master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`： Spark 任务提交时指定的参数，可选。具体可参考 spark-submit 命令，每个 arg  必须以`--`开头，不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`： Spark 参数，可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例：
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",
+"yarn_configs" = "yarn.resourcemanager.address=1.1.1.1:800;fs.defaultFS=hdfs://1.1.1.1:801",
+"spark_args" = "--files=/file1,/file2;--jars=/a.jar,/b.jar",
+"spark_configs" = "spark.driver.memory=1g;spark.executor.memory=1g"

Review comment:
       Prefer `spark.args` `spark.configs`

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。
+  - `broker`：broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`： HDFS YARN 参数，master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`： Spark 任务提交时指定的参数，可选。具体可参考 spark-submit 命令，每个 arg  必须以`--`开头，不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`： Spark 参数，可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例：
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",
+"yarn_configs" = "yarn.resourcemanager.address=1.1.1.1:800;fs.defaultFS=hdfs://1.1.1.1:801",

Review comment:
       "yarn.configs"

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。

Review comment:
       Better to explain for what this path is used. And should remove `hdfs_` prefix, because the file may locate in S3 or other external path.

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。

Review comment:
       Whose deploy_mode? If it is spark's deploy_mode, better to call it "spark.deploy_mode"

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。
+  - `broker`：broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`： HDFS YARN 参数，master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`： Spark 任务提交时指定的参数，可选。具体可参考 spark-submit 命令，每个 arg  必须以`--`开头，不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`： Spark 参数，可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例：
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",

Review comment:
       what is the master mean? It is not very clear.

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。
+  - `broker`：broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`： HDFS YARN 参数，master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`： Spark 任务提交时指定的参数，可选。具体可参考 spark-submit 命令，每个 arg  必须以`--`开头，不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`： Spark 参数，可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例：
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",
+"yarn_configs" = "yarn.resourcemanager.address=1.1.1.1:800;fs.defaultFS=hdfs://1.1.1.1:801",
+"spark_args" = "--files=/file1,/file2;--jars=/a.jar,/b.jar",
+"spark_configs" = "spark.driver.memory=1g;spark.executor.memory=1g"
+);
+
+-- spark standalone client 模式
+ALTER SYSTEM ADD LOAD CLUSTER "cluster1"
+PROPERTIES
+(
+ "type" = "spark", 
+ "master" = "spark://1.1.1.1:802",
+ "deploy_mode" = "client",
+ "hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+ "broker" = "broker1"
+);
+```
+
+
+
+### 创建导入
+
+语法：
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH CLUSTER cluster_name cluster_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* cluster_properties: 
+    (key2=value2, ...)
+```
+示例：
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH CLUSTER 'cluster0'
+(
+    "broker.username"="user",
+    "broker.password"="pass"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务，都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Cluster 参数
+
+ETL cluster需要提前配置到 Doris系统中才能使用 Spark load。
+
+当用户有临时性的需求，比如增加任务使用的资源而修改 Spark configs，可以在这里设置，设置仅对本次任务生效，并不影响 Doris 集群中已有的配置。
+
+另外如果需要指定额外的 Broker 参数，则需要指定"broker.key" = "value"。具体参数请参阅 [Broker文档](../broker.md)。例如需要指定用户名密码，如下：
+
+```sql
+WITH CLUSTER 'cluster0'
+(
+    "spark_configs" = "spark.driver.memory=1g;spark.executor.memory=1g",

Review comment:
       spark.configs

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。
+  - `deploy_mode`： 可选，默认为 cluster。支持 cluster，client 两种。
+  - `hdfs_etl_path`：ETL 使用的 HDFS 目录。必填。例如：hdfs://host:port/tmp/doris。
+  - `broker`：broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`： HDFS YARN 参数，master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`： Spark 任务提交时指定的参数，可选。具体可参考 spark-submit 命令，每个 arg  必须以`--`开头，不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`： Spark 参数，可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例：
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",

Review comment:
       Is this broker is necessary to define a cluster? I think user can specify it in `Load` stmt. Keep consistent with current syntax "with broker"

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name

Review comment:
       Is cluster is a good name?
   I think Doris will support multi cluster in the future. And some cluster will be used as load cluster. Then it will be a conflict between two clusters.
   So we better to choose another name.
   

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数，如下：
+
+- `type`：集群类型，必填，目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下：
+  - `master`：必填，目前支持yarn，spark://host:port。

Review comment:
       Should explain this option more clearly. And I don't know what this master stands for.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org