You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/04/28 17:40:51 UTC

[GitHub] [incubator-doris] wyb opened a new pull request #3418: Add spark etl cluster and cluster manager

wyb opened a new pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418


   Spark clusters are used to preprocess data (partition, sort, aggregation) in spark load, and should be managed by Doris.
   
   Spark load documentation: docs/documentation/cn/administrator-guide/load-data/spark-load-manual.md
   
   1. add spark cluster
   ```sql
   ALTER SYSTEM ADD LOAD CLUSTER cluster_name
   PROPERTIES("key1" = "value1", ...)
   ```
   
   2. drop spark cluster
   ```sql
   ALTER SYSTEM DROP LOAD CLUSTER cluster_name
   ```
   
   3. show spark cluster
   ```sql
   SHOW LOAD CLUSTERS
   SHOW PROC "/load_etl_clusters"
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r427358330



##########
File path: fe/src/main/java/org/apache/doris/analysis/CreateResourceStmt.java
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.catalog.Catalog;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.ErrorCode;
+import org.apache.doris.common.ErrorReport;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.UserException;
+import org.apache.doris.common.util.PrintableMap;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+
+import java.util.Map;
+
+// CREATE [EXTERNAL] RESOURCE resource_name
+// PROPERTIES (key1 = value1, ...)
+public class CreateResourceStmt extends DdlStmt {
+    private static final String TYPE = "type";
+
+    private final boolean isExternal;
+    private final String resourceName;
+    private final Map<String, String> properties;
+
+    public CreateResourceStmt(boolean isExternal, String resourceName, Map<String, String> properties) {
+        this.isExternal = isExternal;
+        this.resourceName = resourceName;
+        this.properties = properties;
+    }
+
+    public String getResourceName() {
+        return resourceName;
+    }
+
+    public Map<String, String> getProperties() {
+        return properties;
+    }
+
+    public ResourceType getResourceType() {
+        return ResourceType.fromString(properties.get(TYPE));
+    }
+
+    @Override
+    public void analyze(Analyzer analyzer) throws UserException {
+        super.analyze(analyzer);
+
+        // check auth
+        if (!Catalog.getCurrentCatalog().getAuth().checkGlobalPriv(ConnectContext.get(), PrivPredicate.ADMIN)) {
+            ErrorReport.reportAnalysisException(ErrorCode.ERR_SPECIFIC_ACCESS_DENIED_ERROR, "ADMIN");
+        }
+
+        // check name
+        FeNameFormat.checkResourceName(resourceName);
+
+        // check type in properties
+        if (properties == null || properties.isEmpty()) {
+            throw new AnalysisException("Resource properties can't be null");
+        }
+        String type = properties.get(TYPE);
+        if (type == null) {
+            throw new AnalysisException("Resource type can't be null");
+        }
+        ResourceType resourceType = ResourceType.fromString(type);
+        if (resourceType == null) {
+            throw new AnalysisException("Unsupported resource type: " + type);
+        }
+        if (resourceType == ResourceType.SPARK && !isExternal) {
+            throw new AnalysisException("Spark is external resource");
+        }

Review comment:
       already check in SparkResource




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman merged pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
morningman merged pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r428745311



##########
File path: fe/src/main/java/org/apache/doris/mysql/privilege/PaloPrivilege.java
##########
@@ -25,7 +25,8 @@
     LOAD_PRIV("Load_priv", 4, "Privilege for loading data into tables"),
     ALTER_PRIV("Alter_priv", 5, "Privilege for alter database or table"),
     CREATE_PRIV("Create_priv", 6, "Privilege for createing database or table"),
-    DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table");
+    DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table"),
+    USAGE_PRIV("Usage_priv", 8, "Privilege for use resource");

Review comment:
       using resource?
   reference snowflake usage privilege




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#issuecomment-621616614


   > @wyb Hi, why comment the update load cluster code?
   
   Because EtlClusterDesc class is used in the load job process, and is not in this pr. 
   I will remove the comment in the load job process pr.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429148045



##########
File path: docs/zh-CN/administrator-guide/resource-management.md
##########
@@ -0,0 +1,125 @@
+---
+{
+    "title": "资源管理",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"                                  
+     PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+     * `type`:资源类型,必填,目前仅支持 spark。
+     * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+#### 参数
+
+##### Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+##### 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+#### 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",

Review comment:
       Is "spark.hadoop.fs.defaultFS"  is used for load data from HDFS? Or it seems that only ```working_dir```  is enough




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r428732335



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。

Review comment:
       spark.xxx is the standard format of spark configuration,so i think it is better to use working_dir to distinguish




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429174060



##########
File path: docs/zh-CN/administrator-guide/resource-management.md
##########
@@ -0,0 +1,125 @@
+---
+{
+    "title": "资源管理",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"                                  
+     PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+     * `type`:资源类型,必填,目前仅支持 spark。
+     * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+#### 参数
+
+##### Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+##### 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+#### 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",

Review comment:
       For the user case which they on has one cluster(one spark cluster, one hdfs) for ETL, spark client may read the ```hdfs-site.xml``` by ```HADOOP_HOME```, so that user needn't specify ```defaultFS ``` every time submit a spark job.
   In this case, spark.hadoop.fs.defaultFS is not a necessary item, it should be a optional item




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429129008



##########
File path: fe/src/main/java/org/apache/doris/catalog/Resource.java
##########
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.persist.gson.GsonUtils;
+
+import com.google.gson.annotations.SerializedName;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Map;
+
+public abstract class Resource implements Writable {
+    public enum ResourceType {
+        UNKNOWN,
+        SPARK;
+
+        public static ResourceType fromString(String resourceType) {

Review comment:
       CaseInsensitive in fromString function and return UNKNOWN if resourceType does not exist




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wangbo commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wangbo commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429174060



##########
File path: docs/zh-CN/administrator-guide/resource-management.md
##########
@@ -0,0 +1,125 @@
+---
+{
+    "title": "资源管理",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"                                  
+     PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+     * `type`:资源类型,必填,目前仅支持 spark。
+     * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+#### 参数
+
+##### Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+##### 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+#### 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",

Review comment:
       For the user case which they only has one cluster(one spark cluster, one hdfs) for ETL, spark client may read the ```hdfs-site.xml``` by ```HADOOP_HOME```, so that user needn't specify ```defaultFS ``` every time submit a spark job.
   In this case, spark.hadoop.fs.defaultFS is not a necessary item, it should be a optional item




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r428734738



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:7777",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+#### 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+#### 资源权限
+
+资源权限通过GRANT REVOKE来管理,目前仅支持USAGE_PRIV使用权限。
+
+可以将USAGE_PRIV权限赋予某个用户或者某个角色,角色的使用与之前一致。
+```sql
+-- 授予spark0资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%";
+-- 授予spark0资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0";
+-- 授予所有资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%";
+-- 授予所有资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0";
+-- 撤销用户user0的spark0资源使用权限
+REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%";
+```
+
+
+
+### 创建导入
+
+语法:
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH RESOURCE resource_name resource_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* resource_properties: 
+    (key2=value2, ...)
+```
+示例:
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH RESOURCE 'spark0'
+(
+    "spark.executor.memory" = "2g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Spark资源参数
+
+Spark资源需要提前配置到 Doris系统中并且赋予用户USAGE_PRIV权限后才能使用 Spark load。
+
+当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。
+
+```sql
+WITH RESOURCE 'spark0'
+(
+  "spark.driver.memory" = "1g",
+  "spark.executor.memory" = "3g"
+)
+```
+
+
+
+### 查看导入
+
+Spark load 导入方式同 Broker load 一样都是异步的,所以用户必须将创建导入的 Label 记录,并且在**查看导入命令中使用 Label 来查看导入结果**。查看导入命令在所有导入方式中是通用的,具体语法可执行 ```HELP SHOW LOAD``` 查看。
+
+示例:
+
+```
+mysql> show load order by createtime desc limit 1\G
+*************************** 1. row ***************************
+         JobId: 76391
+         Label: label1
+         State: FINISHED
+      Progress: ETL:100%; LOAD:100%
+          Type: SPARK
+       EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376
+      TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5
+      ErrorMsg: N/A
+    CreateTime: 2019-07-27 11:46:42
+  EtlStartTime: 2019-07-27 11:46:44
+ EtlFinishTime: 2019-07-27 11:49:44
+ LoadStartTime: 2019-07-27 11:49:44
+LoadFinishTime: 2019-07-27 11:50:16
+           URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/
+    JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000}
+```
+
+返回结果集中参数意义可以参考 Broker load。不同点如下:
+
++ State
+
+    导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。
+    
+    导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。
+    
++ Progress
+
+    导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。
+    
+    LOAD 的进度范围为:0~100%。
+    
+    ```LOAD 进度 = 当前已完成所有replica导入的tablet个数 / 本次导入任务的总tablet个数 * 100%``` 
+    
+    **如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。
+    
+    导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。
+    
++ Type
+
+    导入任务的类型。Spark load 为 SPARK。    
+
++ CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime
+
+    这几个值分别代表导入创建的时间,ETL 阶段开始的时间,ETL 阶段完成的时间,LOADING 阶段开始的时间和整个导入任务完成的时间。
+
++ JobDetails
+
+    显示一些作业的详细运行状态,ETL 结束的时候更新。包括导入文件的个数、总大小(字节)、子任务个数、已处理的原始行数等。
+
+    ```{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}```

Review comment:
       No, FE gets the statis when etl job is finished




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r427361798



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。
+  - `broker`:broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`: HDFS YARN 参数,master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`: Spark 任务提交时指定的参数,可选。具体可参考 spark-submit 命令,每个 arg  必须以`--`开头,不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`: Spark 参数,可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",

Review comment:
       the broker is used to read the ETL intermediate results in the working_dir, not user source data.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] imay commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
imay commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r426222831



##########
File path: fe/src/main/java/org/apache/doris/analysis/CreateResourceStmt.java
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.catalog.Catalog;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.ErrorCode;
+import org.apache.doris.common.ErrorReport;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.UserException;
+import org.apache.doris.common.util.PrintableMap;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+
+import java.util.Map;
+
+// CREATE [EXTERNAL] RESOURCE resource_name
+// PROPERTIES (key1 = value1, ...)
+public class CreateResourceStmt extends DdlStmt {
+    private static final String TYPE = "type";
+
+    private final boolean isExternal;
+    private final String resourceName;
+    private final Map<String, String> properties;
+
+    public CreateResourceStmt(boolean isExternal, String resourceName, Map<String, String> properties) {
+        this.isExternal = isExternal;
+        this.resourceName = resourceName;
+        this.properties = properties;
+    }
+
+    public String getResourceName() {
+        return resourceName;
+    }
+
+    public Map<String, String> getProperties() {
+        return properties;
+    }
+
+    public ResourceType getResourceType() {
+        return ResourceType.fromString(properties.get(TYPE));
+    }
+
+    @Override
+    public void analyze(Analyzer analyzer) throws UserException {
+        super.analyze(analyzer);
+
+        // check auth
+        if (!Catalog.getCurrentCatalog().getAuth().checkGlobalPriv(ConnectContext.get(), PrivPredicate.ADMIN)) {
+            ErrorReport.reportAnalysisException(ErrorCode.ERR_SPECIFIC_ACCESS_DENIED_ERROR, "ADMIN");
+        }
+
+        // check name
+        FeNameFormat.checkResourceName(resourceName);
+
+        // check type in properties
+        if (properties == null || properties.isEmpty()) {
+            throw new AnalysisException("Resource properties can't be null");
+        }
+        String type = properties.get(TYPE);
+        if (type == null) {
+            throw new AnalysisException("Resource type can't be null");
+        }
+        ResourceType resourceType = ResourceType.fromString(type);
+        if (resourceType == null) {
+            throw new AnalysisException("Unsupported resource type: " + type);
+        }
+        if (resourceType == ResourceType.SPARK && !isExternal) {
+            throw new AnalysisException("Spark is external resource");
+        }

Review comment:
       If there are properties which are not valid, better to throw exception to let users know

##########
File path: fe/src/main/java/org/apache/doris/analysis/CreateResourceStmt.java
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.catalog.Catalog;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.ErrorCode;
+import org.apache.doris.common.ErrorReport;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.UserException;
+import org.apache.doris.common.util.PrintableMap;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+
+import java.util.Map;
+
+// CREATE [EXTERNAL] RESOURCE resource_name
+// PROPERTIES (key1 = value1, ...)
+public class CreateResourceStmt extends DdlStmt {
+    private static final String TYPE = "type";
+
+    private final boolean isExternal;
+    private final String resourceName;
+    private final Map<String, String> properties;
+
+    public CreateResourceStmt(boolean isExternal, String resourceName, Map<String, String> properties) {
+        this.isExternal = isExternal;
+        this.resourceName = resourceName;
+        this.properties = properties;
+    }
+
+    public String getResourceName() {
+        return resourceName;
+    }
+
+    public Map<String, String> getProperties() {
+        return properties;
+    }
+
+    public ResourceType getResourceType() {
+        return ResourceType.fromString(properties.get(TYPE));
+    }
+
+    @Override
+    public void analyze(Analyzer analyzer) throws UserException {
+        super.analyze(analyzer);
+
+        // check auth
+        if (!Catalog.getCurrentCatalog().getAuth().checkGlobalPriv(ConnectContext.get(), PrivPredicate.ADMIN)) {
+            ErrorReport.reportAnalysisException(ErrorCode.ERR_SPECIFIC_ACCESS_DENIED_ERROR, "ADMIN");
+        }
+
+        // check name
+        FeNameFormat.checkResourceName(resourceName);
+
+        // check type in properties
+        if (properties == null || properties.isEmpty()) {
+            throw new AnalysisException("Resource properties can't be null");
+        }
+        String type = properties.get(TYPE);
+        if (type == null) {
+            throw new AnalysisException("Resource type can't be null");
+        }
+        ResourceType resourceType = ResourceType.fromString(type);

Review comment:
       If the resource type has been resolved, better to save it as a class member.

##########
File path: fe/src/main/cup/sql_parser.cup
##########
@@ -1083,6 +1084,11 @@ create_stmt ::=
     {:
         RESULT = new AlterTableStmt(tableName, Lists.newArrayList(new CreateIndexClause(tableName, new IndexDef(indexName, cols, indexType, comment), false)));
     :}
+    /* resource */
+    | KW_CREATE opt_external:isExternal KW_RESOURCE ident_or_text:resourceName opt_properties:properties
+    {:
+        RESULT = new CreateResourceStmt(isExternal, resourceName, properties);

Review comment:
       Should update documents for all changed SQL reference.

##########
File path: fe/src/main/java/org/apache/doris/catalog/Resource.java
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.persist.gson.GsonUtils;
+
+import com.google.gson.annotations.SerializedName;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Map;
+
+public abstract class Resource implements Writable {
+    public enum ResourceType {
+        SPARK;
+
+        public static ResourceType fromString(String resourceType) {
+            for (ResourceType type : ResourceType.values()) {
+                if (type.name().equalsIgnoreCase(resourceType)) {
+                    return type;
+                }
+            }
+            return null;
+        }
+    }
+
+    @SerializedName(value = "name")
+    protected String name;
+    @SerializedName(value = "type")
+    protected ResourceType type;
+
+    public Resource(String name, ResourceType type) {
+        this.name = name;
+        this.type = type;
+    }
+
+    public static Resource fromStmt(CreateResourceStmt stmt) throws DdlException {
+        Resource resource = null;
+        ResourceType type = stmt.getResourceType();
+        switch (type) {
+            case SPARK:
+                resource = new SparkResource(stmt.getResourceName());
+                break;
+            default:
+                throw new DdlException("Only support Spark resource.");
+        }
+
+        resource.setProperties(stmt.getProperties());
+        return resource;
+    }
+
+    public String getName() {
+        return name;
+    }
+
+    public ResourceType getType() {
+        return type;
+    }
+
+    protected abstract void setProperties(Map<String, String> properties) throws DdlException;
+    protected abstract void getProcNodeData(BaseProcResult result);

Review comment:
       should give comments for these functions to help others know how to implement them

##########
File path: fe/src/main/java/org/apache/doris/catalog/Resource.java
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.persist.gson.GsonUtils;
+
+import com.google.gson.annotations.SerializedName;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Map;
+
+public abstract class Resource implements Writable {
+    public enum ResourceType {
+        SPARK;

Review comment:
       ```suggestion
           UNKNOWN,
           SPARK;
   ```

##########
File path: fe/src/main/java/org/apache/doris/catalog/Resource.java
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.persist.gson.GsonUtils;
+
+import com.google.gson.annotations.SerializedName;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Map;
+
+public abstract class Resource implements Writable {
+    public enum ResourceType {
+        SPARK;
+
+        public static ResourceType fromString(String resourceType) {
+            for (ResourceType type : ResourceType.values()) {
+                if (type.name().equalsIgnoreCase(resourceType)) {
+                    return type;
+                }
+            }
+            return null;

Review comment:
       ```suggestion
               return UNKNOWN;
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429139485



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:7777",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+#### 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+#### 资源权限
+
+资源权限通过GRANT REVOKE来管理,目前仅支持USAGE_PRIV使用权限。
+
+可以将USAGE_PRIV权限赋予某个用户或者某个角色,角色的使用与之前一致。
+```sql
+-- 授予spark0资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%";
+-- 授予spark0资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0";
+-- 授予所有资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%";
+-- 授予所有资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0";
+-- 撤销用户user0的spark0资源使用权限
+REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%";
+```
+
+
+
+### 创建导入
+
+语法:
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH RESOURCE resource_name resource_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* resource_properties: 
+    (key2=value2, ...)
+```
+示例:
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH RESOURCE 'spark0'
+(
+    "spark.executor.memory" = "2g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Spark资源参数
+
+Spark资源需要提前配置到 Doris系统中并且赋予用户USAGE_PRIV权限后才能使用 Spark load。
+
+当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。
+
+```sql
+WITH RESOURCE 'spark0'
+(
+  "spark.driver.memory" = "1g",
+  "spark.executor.memory" = "3g"
+)
+```
+
+
+
+### 查看导入
+
+Spark load 导入方式同 Broker load 一样都是异步的,所以用户必须将创建导入的 Label 记录,并且在**查看导入命令中使用 Label 来查看导入结果**。查看导入命令在所有导入方式中是通用的,具体语法可执行 ```HELP SHOW LOAD``` 查看。
+
+示例:
+
+```
+mysql> show load order by createtime desc limit 1\G
+*************************** 1. row ***************************
+         JobId: 76391
+         Label: label1
+         State: FINISHED
+      Progress: ETL:100%; LOAD:100%
+          Type: SPARK
+       EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376
+      TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5
+      ErrorMsg: N/A
+    CreateTime: 2019-07-27 11:46:42
+  EtlStartTime: 2019-07-27 11:46:44
+ EtlFinishTime: 2019-07-27 11:49:44
+ LoadStartTime: 2019-07-27 11:49:44
+LoadFinishTime: 2019-07-27 11:50:16
+           URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/
+    JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000}
+```
+
+返回结果集中参数意义可以参考 Broker load。不同点如下:
+
++ State
+
+    导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。
+    
+    导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。
+    
++ Progress
+
+    导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。
+    
+    LOAD 的进度范围为:0~100%。
+    
+    ```LOAD 进度 = 当前已完成所有replica导入的tablet个数 / 本次导入任务的总tablet个数 * 100%``` 
+    
+    **如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。
+    
+    导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。
+    
++ Type
+
+    导入任务的类型。Spark load 为 SPARK。    
+
++ CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime
+
+    这几个值分别代表导入创建的时间,ETL 阶段开始的时间,ETL 阶段完成的时间,LOADING 阶段开始的时间和整个导入任务完成的时间。
+
++ JobDetails
+
+    显示一些作业的详细运行状态,ETL 结束的时候更新。包括导入文件的个数、总大小(字节)、子任务个数、已处理的原始行数等。
+
+    ```{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}```

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429082195



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:7777",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+#### 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+#### 资源权限
+
+资源权限通过GRANT REVOKE来管理,目前仅支持USAGE_PRIV使用权限。
+
+可以将USAGE_PRIV权限赋予某个用户或者某个角色,角色的使用与之前一致。
+```sql
+-- 授予spark0资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%";
+-- 授予spark0资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0";
+-- 授予所有资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%";
+-- 授予所有资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0";
+-- 撤销用户user0的spark0资源使用权限
+REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%";
+```
+
+
+
+### 创建导入
+
+语法:
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH RESOURCE resource_name resource_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* resource_properties: 
+    (key2=value2, ...)
+```
+示例:
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH RESOURCE 'spark0'
+(
+    "spark.executor.memory" = "2g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Spark资源参数
+
+Spark资源需要提前配置到 Doris系统中并且赋予用户USAGE_PRIV权限后才能使用 Spark load。
+
+当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。
+
+```sql
+WITH RESOURCE 'spark0'
+(
+  "spark.driver.memory" = "1g",
+  "spark.executor.memory" = "3g"
+)
+```
+
+
+
+### 查看导入
+
+Spark load 导入方式同 Broker load 一样都是异步的,所以用户必须将创建导入的 Label 记录,并且在**查看导入命令中使用 Label 来查看导入结果**。查看导入命令在所有导入方式中是通用的,具体语法可执行 ```HELP SHOW LOAD``` 查看。
+
+示例:
+
+```
+mysql> show load order by createtime desc limit 1\G
+*************************** 1. row ***************************
+         JobId: 76391
+         Label: label1
+         State: FINISHED
+      Progress: ETL:100%; LOAD:100%
+          Type: SPARK
+       EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376
+      TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5
+      ErrorMsg: N/A
+    CreateTime: 2019-07-27 11:46:42
+  EtlStartTime: 2019-07-27 11:46:44
+ EtlFinishTime: 2019-07-27 11:49:44
+ LoadStartTime: 2019-07-27 11:49:44
+LoadFinishTime: 2019-07-27 11:50:16
+           URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/
+    JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000}
+```
+
+返回结果集中参数意义可以参考 Broker load。不同点如下:
+
++ State
+
+    导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。
+    
+    导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。
+    
++ Progress
+
+    导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。
+    
+    LOAD 的进度范围为:0~100%。
+    
+    ```LOAD 进度 = 当前已完成所有replica导入的tablet个数 / 本次导入任务的总tablet个数 * 100%``` 
+    
+    **如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。
+    
+    导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。
+    
++ Type
+
+    导入任务的类型。Spark load 为 SPARK。    
+
++ CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime
+
+    这几个值分别代表导入创建的时间,ETL 阶段开始的时间,ETL 阶段完成的时间,LOADING 阶段开始的时间和整个导入任务完成的时间。
+
++ JobDetails
+
+    显示一些作业的详细运行状态,ETL 结束的时候更新。包括导入文件的个数、总大小(字节)、子任务个数、已处理的原始行数等。
+
+    ```{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}```

Review comment:
       Maybe we can add a rpc client in our DPP application, so that we can send some info back to the FE periodically.
   This is just an optimization, can be done later.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429167261



##########
File path: docs/zh-CN/administrator-guide/resource-management.md
##########
@@ -0,0 +1,125 @@
+---
+{
+    "title": "资源管理",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"                                  
+     PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+     * `type`:资源类型,必填,目前仅支持 spark。
+     * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+#### 参数
+
+##### Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+##### 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+#### 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",

Review comment:
       In yarn cluster deploy mode, "spark.hadoop.fs.defaultFS" is used in spark etl job for storing hdfs://host:port/user/xxx/.sparkStaging/appid/__spark_libs__xxx.zip and hdfs://host:port/user/xxx/.sparkStaging/appid/__spark_conf__.zip files




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] imay commented on pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
imay commented on pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#issuecomment-621672774


   > ```sql
   > SHOW PROC "/load_etl_clusters"
   > ```
   
   why `load_etl_clusters`? seems `load_clusters` is OK.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429127459



##########
File path: fe/src/main/java/org/apache/doris/analysis/ResourcePattern.java
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.mysql.privilege.PaloAuth.PrivLevel;
+
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+// only the following 2 formats are allowed
+// *
+// resource
+public class ResourcePattern implements Writable {
+    private String resourceName;
+    boolean isAnalyzed = false;

Review comment:
       removed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429501802



##########
File path: docs/zh-CN/administrator-guide/resource-management.md
##########
@@ -0,0 +1,125 @@
+---
+{
+    "title": "资源管理",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"                                  
+     PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+     * `type`:资源类型,必填,目前仅支持 spark。
+     * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+#### 参数
+
+##### Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+##### 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+#### 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",

Review comment:
       Not all users have only one cluster, so i think we can't load configurations from single `HADOOP_HOME` source. 
   Now users need to specify `defaultFS` only one time when creating a new spark resource.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] imay commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
imay commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429012389



##########
File path: fe/src/main/java/org/apache/doris/analysis/ResourcePattern.java
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.mysql.privilege.PaloAuth.PrivLevel;
+
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+// only the following 2 formats are allowed
+// *
+// resource
+public class ResourcePattern implements Writable {
+    private String resourceName;
+    boolean isAnalyzed = false;
+
+    public static ResourcePattern ALL;
+    static {
+        ALL = new ResourcePattern("*");
+        try {
+            ALL.analyze();
+        } catch (AnalysisException e) {
+            // will not happen
+        }
+    }
+
+    private ResourcePattern() {
+    }
+
+    public ResourcePattern(String resourceName) {
+        this.resourceName = Strings.isNullOrEmpty(resourceName) ? "*" : resourceName;
+    }
+
+    public String getResourceName() {
+        Preconditions.checkState(isAnalyzed);
+        return resourceName;
+    }
+
+    public PrivLevel getPrivLevel() {
+        Preconditions.checkState(isAnalyzed);
+        if (resourceName.equals("*")) {
+            return PrivLevel.GLOBAL;
+        } else {
+            return PrivLevel.RESOURCE;
+        }
+    }
+
+    public void analyze() throws AnalysisException {
+        if (isAnalyzed) {
+            return;
+        }
+        if (!resourceName.equals("*")) {
+            FeNameFormat.checkResourceName(resourceName);
+        }
+        isAnalyzed = true;
+    }
+
+    public static ResourcePattern read(DataInput in) throws IOException {
+        ResourcePattern resourcePattern = new ResourcePattern();
+        resourcePattern.readFields(in);
+        return resourcePattern;
+    }
+
+    @Override
+    public boolean equals(Object obj) {
+        if (!(obj instanceof ResourcePattern)) {
+            return false;
+        }
+        ResourcePattern other = (ResourcePattern) obj;
+        return resourceName.equals(other.getResourceName());
+    }
+
+    @Override
+    public int hashCode() {
+        int result = 17;
+        result = 31 * result + resourceName.hashCode();
+        return result;
+    }
+
+    @Override
+    public String toString() {
+        return resourceName;
+    }
+
+    @Override
+    public void write(DataOutput out) throws IOException {
+        Preconditions.checkState(isAnalyzed);
+        Text.writeString(out, resourceName);

Review comment:
       Please serialize in a json format, you can refer to other class usage.

##########
File path: fe/src/main/java/org/apache/doris/mysql/privilege/PaloRole.java
##########
@@ -129,6 +163,16 @@ public void readFields(DataInput in) throws IOException {
             PrivBitSet privs = PrivBitSet.read(in);
             tblPatternToPrivs.put(tblPattern, privs);
         }
+        /*

Review comment:
       Better to add "TODO(wyb): spark-load" to find it easily

##########
File path: fe/src/main/java/org/apache/doris/analysis/ResourcePattern.java
##########
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.mysql.privilege.PaloAuth.PrivLevel;
+
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+// only the following 2 formats are allowed
+// *
+// resource
+public class ResourcePattern implements Writable {
+    private String resourceName;
+    boolean isAnalyzed = false;

Review comment:
       Is this isAnalyzed needed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429083102



##########
File path: fe/src/main/java/org/apache/doris/mysql/privilege/PaloPrivilege.java
##########
@@ -25,7 +25,8 @@
     LOAD_PRIV("Load_priv", 4, "Privilege for loading data into tables"),
     ALTER_PRIV("Alter_priv", 5, "Privilege for alter database or table"),
     CREATE_PRIV("Create_priv", 6, "Privilege for createing database or table"),
-    DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table");
+    DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table"),
+    USAGE_PRIV("Usage_priv", 8, "Privilege for use resource");

Review comment:
       ok




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r427356073



##########
File path: docs/zh-CN/sql-reference/sql-statements/Account Management/GRANT.md
##########
@@ -33,6 +33,8 @@ Syntax:
 
     GRANT privilege_list ON db_name[.tbl_name] TO user_identity [ROLE role_name]
 
+    GRANT privilege_list ON resource_name TO user_identity [ROLE role_name]

Review comment:
       ```suggestion
       GRANT privilege_list ON RESOURCE resource_name TO user_identity [ROLE role_name]
   ```

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。

Review comment:
       How about "spark.working_dir"?

##########
File path: fe/src/main/java/org/apache/doris/catalog/Resource.java
##########
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.persist.gson.GsonUtils;
+
+import com.google.gson.annotations.SerializedName;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Map;
+
+public abstract class Resource implements Writable {
+    public enum ResourceType {
+        UNKNOWN,
+        SPARK;
+
+        public static ResourceType fromString(String resourceType) {

Review comment:
       Enum class has a method `valueOf(String)`, which is same as this `fromString()`

##########
File path: docs/zh-CN/sql-reference/sql-statements/Account Management/REVOKE.md
##########
@@ -43,6 +45,10 @@ under the License.
    
         REVOKE SELECT_PRIV ON db1.* FROM 'jack'@'192.%';
 
+    1. 撤销用户 jack 资源 spark_resource 的使用权限
+
+        REVOKE USAGE_RPIV ON 'spark_resource' FROM 'jack'@'192.%';

Review comment:
       ```suggestion
           REVOKE USAGE_RPIV ON RESOURCE 'spark_resource' FROM 'jack'@'192.%';
   ```

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:7777",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+#### 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+#### 资源权限
+
+资源权限通过GRANT REVOKE来管理,目前仅支持USAGE_PRIV使用权限。
+
+可以将USAGE_PRIV权限赋予某个用户或者某个角色,角色的使用与之前一致。
+```sql
+-- 授予spark0资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%";
+-- 授予spark0资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0";
+-- 授予所有资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%";
+-- 授予所有资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0";
+-- 撤销用户user0的spark0资源使用权限
+REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%";
+```
+
+
+
+### 创建导入
+
+语法:
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH RESOURCE resource_name resource_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* resource_properties: 
+    (key2=value2, ...)
+```
+示例:
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH RESOURCE 'spark0'
+(
+    "spark.executor.memory" = "2g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Spark资源参数
+
+Spark资源需要提前配置到 Doris系统中并且赋予用户USAGE_PRIV权限后才能使用 Spark load。
+
+当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。
+
+```sql
+WITH RESOURCE 'spark0'
+(
+  "spark.driver.memory" = "1g",
+  "spark.executor.memory" = "3g"
+)
+```
+
+
+
+### 查看导入
+
+Spark load 导入方式同 Broker load 一样都是异步的,所以用户必须将创建导入的 Label 记录,并且在**查看导入命令中使用 Label 来查看导入结果**。查看导入命令在所有导入方式中是通用的,具体语法可执行 ```HELP SHOW LOAD``` 查看。
+
+示例:
+
+```
+mysql> show load order by createtime desc limit 1\G
+*************************** 1. row ***************************
+         JobId: 76391
+         Label: label1
+         State: FINISHED
+      Progress: ETL:100%; LOAD:100%
+          Type: SPARK
+       EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376
+      TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5
+      ErrorMsg: N/A
+    CreateTime: 2019-07-27 11:46:42
+  EtlStartTime: 2019-07-27 11:46:44
+ EtlFinishTime: 2019-07-27 11:49:44
+ LoadStartTime: 2019-07-27 11:49:44
+LoadFinishTime: 2019-07-27 11:50:16
+           URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/
+    JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000}
+```
+
+返回结果集中参数意义可以参考 Broker load。不同点如下:
+
++ State
+
+    导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。
+    
+    导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。
+    
++ Progress
+
+    导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。
+    
+    LOAD 的进度范围为:0~100%。
+    
+    ```LOAD 进度 = 当前已完成所有replica导入的tablet个数 / 本次导入任务的总tablet个数 * 100%``` 
+    
+    **如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。
+    
+    导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。
+    
++ Type
+
+    导入任务的类型。Spark load 为 SPARK。    
+
++ CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime
+
+    这几个值分别代表导入创建的时间,ETL 阶段开始的时间,ETL 阶段完成的时间,LOADING 阶段开始的时间和整个导入任务完成的时间。
+
++ JobDetails
+
+    显示一些作业的详细运行状态,ETL 结束的时候更新。包括导入文件的个数、总大小(字节)、子任务个数、已处理的原始行数等。
+
+    ```{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}```
+
+### 取消导入
+
+当 Spark load 作业状态不为 CANCELLED 或 FINISHED 时,可以被用户手动取消。取消时需要指定待取消导入任务的 Label 。取消导入命令语法可执行 ```HELP CANCEL LOAD```查看。
+
+
+
+## 相关系统配置
+
+### FE 配置
+
+下面配置属于 Spark load 的系统级别配置,也就是作用于所有 Spark load 导入任务的配置。主要通过修改 ``` fe.conf```来调整配置值。
+
++ spark_load_default_timeout_second
+  
+    任务默认超时时间为259200秒(3天)。
+    
+    
+
+## 最佳实践
+
+### 应用场景
+
+使用 Spark load 最适合的场景就是原始数据在文件系统(HDFS)中,数据量在 几十 GB 到 TB 级别。小数据量还是建议使用 Stream load 或者 Broker load。
+
+
+
+## 常见问题
+
+* 使用Spark load时需要在FE机器设置SPARK_HOME及HADOOP_CONF_DIR环境变量。

Review comment:
       这个最好有具体说明

##########
File path: docs/zh-CN/sql-reference/sql-statements/Account Management/REVOKE.md
##########
@@ -30,6 +30,8 @@ under the License.
     REVOKE 命令用于撤销指定用户或角色指定的权限。
     Syntax:
         REVOKE privilege_list ON db_name[.tbl_name] FROM user_identity [ROLE role_name]
+
+        REVOKE privilege_list ON resource_name FROM user_identity [ROLE role_name]

Review comment:
       ```suggestion
           REVOKE privilege_list ON RESOURCE resource_name FROM user_identity [ROLE role_name]
   ```

##########
File path: docs/zh-CN/sql-reference/sql-statements/Account Management/GRANT.md
##########
@@ -76,6 +92,18 @@ user_identity:
 
         GRANT LOAD_PRIV ON db1.* TO ROLE 'my_role';
 
+    4. 授予所有资源的使用权限给用户
+
+        GRANT USAGE_PRIV ON * TO 'jack'@'%';
+
+    5. 授予指定资源的使用权限给用户
+
+        GRANT USAGE_PRIV ON 'spark_resource' TO 'jack'@'%';

Review comment:
       ```suggestion
           GRANT USAGE_PRIV ON RESOURCE 'spark_resource' TO 'jack'@'%';
   ```

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql

Review comment:
       resource management 应该单独写一个文档

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:7777",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+#### 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+#### 资源权限
+
+资源权限通过GRANT REVOKE来管理,目前仅支持USAGE_PRIV使用权限。
+
+可以将USAGE_PRIV权限赋予某个用户或者某个角色,角色的使用与之前一致。
+```sql
+-- 授予spark0资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%";
+-- 授予spark0资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0";
+-- 授予所有资源的使用权限给用户user0
+GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%";
+-- 授予所有资源的使用权限给角色role0
+GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0";
+-- 撤销用户user0的spark0资源使用权限
+REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%";
+```
+
+
+
+### 创建导入
+
+语法:
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH RESOURCE resource_name resource_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* resource_properties: 
+    (key2=value2, ...)
+```
+示例:
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH RESOURCE 'spark0'
+(
+    "spark.executor.memory" = "2g",
+    "spark.shuffle.compress" = "true"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Spark资源参数
+
+Spark资源需要提前配置到 Doris系统中并且赋予用户USAGE_PRIV权限后才能使用 Spark load。
+
+当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。
+
+```sql
+WITH RESOURCE 'spark0'
+(
+  "spark.driver.memory" = "1g",
+  "spark.executor.memory" = "3g"
+)
+```
+
+
+
+### 查看导入
+
+Spark load 导入方式同 Broker load 一样都是异步的,所以用户必须将创建导入的 Label 记录,并且在**查看导入命令中使用 Label 来查看导入结果**。查看导入命令在所有导入方式中是通用的,具体语法可执行 ```HELP SHOW LOAD``` 查看。
+
+示例:
+
+```
+mysql> show load order by createtime desc limit 1\G
+*************************** 1. row ***************************
+         JobId: 76391
+         Label: label1
+         State: FINISHED
+      Progress: ETL:100%; LOAD:100%
+          Type: SPARK
+       EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376
+      TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5
+      ErrorMsg: N/A
+    CreateTime: 2019-07-27 11:46:42
+  EtlStartTime: 2019-07-27 11:46:44
+ EtlFinishTime: 2019-07-27 11:49:44
+ LoadStartTime: 2019-07-27 11:49:44
+LoadFinishTime: 2019-07-27 11:50:16
+           URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/
+    JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000}
+```
+
+返回结果集中参数意义可以参考 Broker load。不同点如下:
+
++ State
+
+    导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。
+    
+    导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。
+    
++ Progress
+
+    导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。
+    
+    LOAD 的进度范围为:0~100%。
+    
+    ```LOAD 进度 = 当前已完成所有replica导入的tablet个数 / 本次导入任务的总tablet个数 * 100%``` 
+    
+    **如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。
+    
+    导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。
+    
++ Type
+
+    导入任务的类型。Spark load 为 SPARK。    
+
++ CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime
+
+    这几个值分别代表导入创建的时间,ETL 阶段开始的时间,ETL 阶段完成的时间,LOADING 阶段开始的时间和整个导入任务完成的时间。
+
++ JobDetails
+
+    显示一些作业的详细运行状态,ETL 结束的时候更新。包括导入文件的个数、总大小(字节)、子任务个数、已处理的原始行数等。
+
+    ```{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}```

Review comment:
       Can this be updated in realtime?

##########
File path: docs/zh-CN/sql-reference/sql-statements/Account Management/GRANT.md
##########
@@ -76,6 +92,18 @@ user_identity:
 
         GRANT LOAD_PRIV ON db1.* TO ROLE 'my_role';
 
+    4. 授予所有资源的使用权限给用户
+
+        GRANT USAGE_PRIV ON * TO 'jack'@'%';
+
+    5. 授予指定资源的使用权限给用户
+
+        GRANT USAGE_PRIV ON 'spark_resource' TO 'jack'@'%';
+
+    6. 授予指定资源的使用权限给角色
+
+        GRANT USAGE_PRIV ON 'spark_resource' TO ROLE 'my_role';

Review comment:
       ```suggestion
           GRANT USAGE_PRIV ON RESOURCE 'spark_resource' TO ROLE 'my_role';
   ```

##########
File path: fe/src/main/java/org/apache/doris/catalog/ResourceMgr.java
##########
@@ -0,0 +1,188 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.analysis.DropResourceStmt;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.common.proc.ProcNodeInterface;
+import org.apache.doris.common.proc.ProcResult;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.locks.ReentrantLock;
+
+/**
+ * Resource manager is responsible for managing external resources used by Doris.
+ * For example, Spark/MapReduce used for ETL, Spark/GPU used for queries, HDFS/S3 used for external storage.
+ * Now only support Spark.
+ */
+public class ResourceMgr {
+    private static final Logger LOG = LogManager.getLogger(ResourceMgr.class);
+
+    public static final ImmutableList<String> RESOURCE_PROC_NODE_TITLE_NAMES = new ImmutableList.Builder<String>()
+            .add("Name").add("ResourceType").add("Key").add("Value")
+            .build();
+
+    // { resourceName -> Resource}
+    private final Map<String, Resource> nameToResource = Maps.newHashMap();
+    private final ReentrantLock lock = new ReentrantLock();
+    private ResourceProcNode procNode = null;
+
+    public ResourceMgr() {
+    }
+
+    public void createResource(CreateResourceStmt stmt) throws DdlException {
+        lock.lock();
+        try {
+            if (stmt.getResourceType() != ResourceType.SPARK) {
+                throw new DdlException("Only support Spark resource.");
+            }
+
+            String resourceName = stmt.getResourceName();
+            if (nameToResource.containsKey(resourceName)) {
+                throw new DdlException("Resource(" + resourceName + ") already exist");
+            }
+
+            Resource resource = Resource.fromStmt(stmt);
+            nameToResource.put(resourceName, resource);
+            // log add
+            Catalog.getInstance().getEditLog().logCreateResource(resource);
+            LOG.info("create resource success. resource: {}", resource);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public void replayCreateResource(Resource resource) {
+        lock.lock();
+        try {
+            nameToResource.put(resource.getName(), resource);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public void dropResource(DropResourceStmt stmt) throws DdlException {
+        lock.lock();
+        try {
+            String name = stmt.getResourceName();
+            if (!nameToResource.containsKey(name)) {
+                throw new DdlException("Resource(" + name + ") does not exist");
+            }
+
+            nameToResource.remove(name);
+            // log drop
+            Catalog.getInstance().getEditLog().logDropResource(name);
+            LOG.info("drop resource success. resource name: {}", name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public void replayDropResource(String name) {
+        lock.lock();
+        try {
+            nameToResource.remove(name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public boolean containsResource(String name) {
+        lock.lock();
+        try {
+            return nameToResource.containsKey(name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public Resource getResource(String name) {
+        lock.lock();
+        try {
+            return nameToResource.get(name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    // for catalog save image
+    public Collection<Resource> getResources() {
+        return nameToResource.values();
+    }
+
+    public List<List<String>> getResourcesInfo() {
+        lock.lock();
+        try {
+            if (procNode == null) {
+                procNode = new ResourceProcNode();

Review comment:
       I think this `procNode` can be created when constructing this class.

##########
File path: fe/src/main/java/org/apache/doris/catalog/Catalog.java
##########
@@ -2155,6 +2176,18 @@ public long saveLoadJobsV2(DataOutputStream out, long checksum) throws IOExcepti
         return checksum;
     }
 
+    public long saveResources(DataOutputStream dos, long checksum) throws IOException {
+        Collection<Resource> resources = resourceMgr.getResources();

Review comment:
       Why not just use `resourceMgr.write()`?

##########
File path: fe/src/main/java/org/apache/doris/catalog/ResourceMgr.java
##########
@@ -0,0 +1,188 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.analysis.DropResourceStmt;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.common.proc.ProcNodeInterface;
+import org.apache.doris.common.proc.ProcResult;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.locks.ReentrantLock;
+
+/**
+ * Resource manager is responsible for managing external resources used by Doris.
+ * For example, Spark/MapReduce used for ETL, Spark/GPU used for queries, HDFS/S3 used for external storage.
+ * Now only support Spark.
+ */
+public class ResourceMgr {
+    private static final Logger LOG = LogManager.getLogger(ResourceMgr.class);
+
+    public static final ImmutableList<String> RESOURCE_PROC_NODE_TITLE_NAMES = new ImmutableList.Builder<String>()
+            .add("Name").add("ResourceType").add("Key").add("Value")
+            .build();
+
+    // { resourceName -> Resource}
+    private final Map<String, Resource> nameToResource = Maps.newHashMap();

Review comment:
       I think a concurrentMap is enough. And the lock is only used when creating the resource,
   to make "create resource" and "write edit log" atomic.

##########
File path: fe/src/main/java/org/apache/doris/mysql/privilege/PaloPrivilege.java
##########
@@ -25,7 +25,8 @@
     LOAD_PRIV("Load_priv", 4, "Privilege for loading data into tables"),
     ALTER_PRIV("Alter_priv", 5, "Privilege for alter database or table"),
     CREATE_PRIV("Create_priv", 6, "Privilege for createing database or table"),
-    DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table");
+    DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table"),
+    USAGE_PRIV("Usage_priv", 8, "Privilege for use resource");

Review comment:
       ```suggestion
       USAGE_PRIV("Usage_priv", 8, "Privilege for accessing resource");
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] imay commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
imay commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r417824889



##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",

Review comment:
       Is "Spark Load" a good name? Maybe we will support Hadoop or Hive later, however they will share the same load framework.
   So we should give this feature a common name, and spark is only one of all methods.

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。
+  - `broker`:broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`: HDFS YARN 参数,master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`: Spark 任务提交时指定的参数,可选。具体可参考 spark-submit 命令,每个 arg  必须以`--`开头,不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`: Spark 参数,可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",
+"yarn_configs" = "yarn.resourcemanager.address=1.1.1.1:800;fs.defaultFS=hdfs://1.1.1.1:801",
+"spark_args" = "--files=/file1,/file2;--jars=/a.jar,/b.jar",
+"spark_configs" = "spark.driver.memory=1g;spark.executor.memory=1g"

Review comment:
       Prefer `spark.args` `spark.configs`

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。
+  - `broker`:broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`: HDFS YARN 参数,master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`: Spark 任务提交时指定的参数,可选。具体可参考 spark-submit 命令,每个 arg  必须以`--`开头,不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`: Spark 参数,可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",
+"yarn_configs" = "yarn.resourcemanager.address=1.1.1.1:800;fs.defaultFS=hdfs://1.1.1.1:801",

Review comment:
       "yarn.configs"

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。

Review comment:
       Better to explain for what this path is used. And should remove `hdfs_` prefix, because the file may locate in S3 or other external path.

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。

Review comment:
       Whose deploy_mode? If it is spark's deploy_mode, better to call it "spark.deploy_mode"

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。
+  - `broker`:broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`: HDFS YARN 参数,master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`: Spark 任务提交时指定的参数,可选。具体可参考 spark-submit 命令,每个 arg  必须以`--`开头,不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`: Spark 参数,可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",

Review comment:
       what is the master mean? It is not very clear.

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。
+  - `broker`:broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`: HDFS YARN 参数,master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`: Spark 任务提交时指定的参数,可选。具体可参考 spark-submit 命令,每个 arg  必须以`--`开头,不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`: Spark 参数,可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",
+"yarn_configs" = "yarn.resourcemanager.address=1.1.1.1:800;fs.defaultFS=hdfs://1.1.1.1:801",
+"spark_args" = "--files=/file1,/file2;--jars=/a.jar,/b.jar",
+"spark_configs" = "spark.driver.memory=1g;spark.executor.memory=1g"
+);
+
+-- spark standalone client 模式
+ALTER SYSTEM ADD LOAD CLUSTER "cluster1"
+PROPERTIES
+(
+ "type" = "spark", 
+ "master" = "spark://1.1.1.1:802",
+ "deploy_mode" = "client",
+ "hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+ "broker" = "broker1"
+);
+```
+
+
+
+### 创建导入
+
+语法:
+
+```sql
+LOAD LABEL load_label 
+    (data_desc, ...)
+    WITH CLUSTER cluster_name cluster_properties
+    [PROPERTIES (key1=value1, ... )]
+
+* load_label:
+	db_name.label_name
+
+* data_desc:
+    DATA INFILE ('file_path', ...)
+    [NEGATIVE]
+    INTO TABLE tbl_name
+    [PARTITION (p1, p2)]
+    [COLUMNS TERMINATED BY separator ]
+    [(col1, ...)]
+    [SET (k1=f1(xx), k2=f2(xx))]
+    [WHERE predicate]
+
+* cluster_properties: 
+    (key2=value2, ...)
+```
+示例:
+
+```sql
+LOAD LABEL db1.label1
+(
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1")
+    INTO TABLE tbl1
+    COLUMNS TERMINATED BY ","
+    (tmp_c1,tmp_c2)
+    SET
+    (
+        id=tmp_c2,
+        name=tmp_c1
+    ),
+    DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2")
+    INTO TABLE tbl2
+    COLUMNS TERMINATED BY ","
+    (col1, col2)
+    where col1 > 1
+)
+WITH CLUSTER 'cluster0'
+(
+    "broker.username"="user",
+    "broker.password"="pass"
+)
+PROPERTIES
+(
+    "timeout" = "3600"
+);
+
+```
+
+创建导入的详细语法执行 ```HELP SPARK LOAD``` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。
+
+#### Label
+
+导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 `Broker Load` 一致。
+
+#### 数据描述类参数
+
+目前支持的数据源有CSV和hive table。其他规则与 `Broker Load` 一致。
+
+#### 导入作业参数
+
+导入作业参数主要指的是 Spark load 创建导入语句中的属于 ```opt_properties```部分的参数。导入作业参数是作用于整个导入作业的。规则与 `Broker Load` 一致。
+
+#### Cluster 参数
+
+ETL cluster需要提前配置到 Doris系统中才能使用 Spark load。
+
+当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。
+
+另外如果需要指定额外的 Broker 参数,则需要指定"broker.key" = "value"。具体参数请参阅 [Broker文档](../broker.md)。例如需要指定用户名密码,如下:
+
+```sql
+WITH CLUSTER 'cluster0'
+(
+    "spark_configs" = "spark.driver.memory=1g;spark.executor.memory=1g",

Review comment:
       spark.configs

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。
+  - `deploy_mode`: 可选,默认为 cluster。支持 cluster,client 两种。
+  - `hdfs_etl_path`:ETL 使用的 HDFS 目录。必填。例如:hdfs://host:port/tmp/doris。
+  - `broker`:broker 名字。必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `yarn_configs`: HDFS YARN 参数,master 为 yarn 时必填。需要指定 yarn.resourcemanager.address 和 fs.defaultFS。不同 configs 之间使用`;`拼接。
+  - `spark_args`: Spark 任务提交时指定的参数,可选。具体可参考 spark-submit 命令,每个 arg  必须以`--`开头,不同 args 之间使用`;`拼接。例如--files=/file1,/file2;--jars=/a.jar,/b.jar。
+  - `spark_configs`: Spark 参数,可选。具体参数可参考http://spark.apache.org/docs/latest/configuration.html。不同 configs 之间使用`;`拼接。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+ALTER SYSTEM ADD LOAD CLUSTER "cluster0"
+PROPERTIES
+(
+"type" = "spark", 
+"master" = "yarn",
+"hdfs_etl_path" = "hdfs://1.1.1.1:801/tmp/doris",
+"broker" = "broker0",

Review comment:
       Is this broker is necessary to define a cluster? I think user can specify it in `Load` stmt. Keep consistent with current syntax "with broker"

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name

Review comment:
       Is cluster is a good name?
   I think Doris will support multi cluster in the future. And some cluster will be used as load cluster. Then it will be a conflict between two clusters.
   So we better to choose another name.
   

##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,351 @@
+---                                                                                 
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- 添加 ETL 集群
+ALTER SYSTEM ADD LOAD CLUSTER cluster_name
+PROPERTIES("key1" = "value1", ...)
+
+-- 删除 ETL 集群
+ALTER SYSTEM DROP LOAD CLUSTER cluster_name
+
+-- 查看 ETL 集群
+SHOW LOAD CLUSTERS
+SHOW PROC "/load_etl_clusters"
+```
+
+`cluster_name` 为 Doris 中配置的 Spark 集群的名字。
+
+PROPERTIES 是 ETL 集群相关参数,如下:
+
+- `type`:集群类型,必填,目前仅支持 spark。
+
+- Spark ETL 集群相关参数如下:
+  - `master`:必填,目前支持yarn,spark://host:port。

Review comment:
       Should explain this option more clearly. And I don't know what this master stands for.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] kangkaisen commented on pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
kangkaisen commented on pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#issuecomment-621594189


   @wyb Hi, why comment the update load cluster code?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Posted by GitBox <gi...@apache.org>.
morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429094156



##########
File path: fe/src/main/java/org/apache/doris/catalog/ResourceMgr.java
##########
@@ -0,0 +1,189 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.analysis.DropResourceStmt;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.common.proc.ProcNodeInterface;
+import org.apache.doris.common.proc.ProcResult;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.locks.ReentrantLock;
+
+/**
+ * Resource manager is responsible for managing external resources used by Doris.
+ * For example, Spark/MapReduce used for ETL, Spark/GPU used for queries, HDFS/S3 used for external storage.
+ * Now only support Spark.
+ */
+public class ResourceMgr implements Writable {
+    private static final Logger LOG = LogManager.getLogger(ResourceMgr.class);
+
+    public static final ImmutableList<String> RESOURCE_PROC_NODE_TITLE_NAMES = new ImmutableList.Builder<String>()
+            .add("Name").add("ResourceType").add("Key").add("Value")
+            .build();
+
+    // { resourceName -> Resource}
+    private final Map<String, Resource> nameToResource = Maps.newHashMap();
+    private final ReentrantLock lock = new ReentrantLock();
+    private final ResourceProcNode procNode = new ResourceProcNode();
+
+    public ResourceMgr() {
+    }
+
+    public void createResource(CreateResourceStmt stmt) throws DdlException {
+        lock.lock();
+        try {
+            if (stmt.getResourceType() != ResourceType.SPARK) {
+                throw new DdlException("Only support Spark resource.");
+            }
+
+            String resourceName = stmt.getResourceName();
+            if (nameToResource.containsKey(resourceName)) {
+                throw new DdlException("Resource(" + resourceName + ") already exist");
+            }
+
+            Resource resource = Resource.fromStmt(stmt);
+            nameToResource.put(resourceName, resource);
+            // log add
+            Catalog.getInstance().getEditLog().logCreateResource(resource);
+            LOG.info("create resource success. resource: {}", resource);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public void replayCreateResource(Resource resource) {
+        lock.lock();
+        try {
+            nameToResource.put(resource.getName(), resource);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public void dropResource(DropResourceStmt stmt) throws DdlException {
+        lock.lock();
+        try {
+            String name = stmt.getResourceName();
+            if (!nameToResource.containsKey(name)) {
+                throw new DdlException("Resource(" + name + ") does not exist");
+            }
+
+            nameToResource.remove(name);
+            // log drop
+            Catalog.getInstance().getEditLog().logDropResource(name);
+            LOG.info("drop resource success. resource name: {}", name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public void replayDropResource(String name) {
+        lock.lock();
+        try {
+            nameToResource.remove(name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public boolean containsResource(String name) {
+        lock.lock();
+        try {
+            return nameToResource.containsKey(name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public Resource getResource(String name) {
+        lock.lock();
+        try {
+            return nameToResource.get(name);
+        } finally {
+            lock.unlock();
+        }
+    }
+
+    public int getResourceNum() {
+        return nameToResource.size();
+    }
+
+    public List<List<String>> getResourcesInfo() {
+        return procNode.fetchResult().getRows();
+    }
+
+    public ResourceProcNode getProcNode() {
+        return procNode;
+    }
+
+    @Override
+    public void write(DataOutput out) throws IOException {
+        out.writeInt(nameToResource.size());

Review comment:
       use Gson instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org