You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/02/21 01:59:58 UTC

[GitHub] [incubator-doris] kangpinghuang commented on a change in pull request #2939: spark etl design

kangpinghuang commented on a change in pull request #2939: spark etl design
URL: https://github.com/apache/incubator-doris/pull/2939#discussion_r382359959
 
 

 ##########
 File path: docs/documentation/cn/internal/spark_etl.md
 ##########
 @@ -0,0 +1,323 @@
+# Doris spark导入ETL逻辑设计
+
+## 背景
+
+Doris为了解决初次迁移，大量数据迁移doris的问题，引入了spark导入，用于提升数据导入的速度。在spark导入中，需要利用spark进行ETL计算、分区、分桶、文件格式生成等逻辑。下面分别讲述具体的实现设计。
+
+## 名词解释
+
+* FE：Frontend，即 Palo 的前端节点。主要负责接收和返回客户端请求、元数据以及集群管理、查询计划生成等工作。
+* BE：Backend，即 Palo 的后端节点。主要负责数据存储与管理、查询计划执行等工作。
+
+## 设计
+
+### 目标
+
+在Spark导入中，需要达到以下目标：
+
+1. 需要考虑支持多种spark部署模式，设计上需要兼容多种部署方式，可以考虑先实现yarn集群的部署模式；
+2. 需要支持包括csv、parquet、orc等多种格式的数据文件。
+3. 能够支持doris中所有的类型，其中包括hll和bitmap类型。同时，bitmap类型需要考虑支持全局字典，以实现string类型的精确去重
+4. 能够支持排序和预聚合
+5. 支持分区分桶逻辑
+6. 支持生成base表和rollup表的数据
+7. 能够支持生成doris的存储格式
+
+### 实现方案
+
+参考[pr-2865](https://github.com/apache/incubator-doris/pull/2856), 整的方案将按照如下的框架实现：
 
 Review comment:
   done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org