You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/02/10 02:34:04 UTC

[GitHub] [incubator-doris] kangpinghuang commented on a change in pull request #2856: add spark load design

kangpinghuang commented on a change in pull request #2856: add spark load design
URL: https://github.com/apache/incubator-doris/pull/2856#discussion_r376847387
 
 

 ##########
 File path: docs/documentation/cn/internal/spark_load.md
 ##########
 @@ -0,0 +1,166 @@
+# Doris支持spark导入设计文档
+
+## 背景
+
+Doris现在支持Broker load/routine load/stream load/mini batch load等多种导入方式。
+spark load主要用于解决初次迁移，大量数据迁移doris的场景，用于提升数据导入的速度。
+
+## 名词解释
+
+* FE：Frontend，即 Palo 的前端节点。主要负责接收和返回客户端请求、元数据以及集群管理、查询计划生成等工作。
+* BE：Backend，即 Palo 的后端节点。主要负责数据存储与管理、查询计划执行等工作。
+* Tablet： 一个palo table的水平分片称为tablet。
+
+## 设计
+
+### 目标
+
+Doris中现有的导入方式中，针对百G级别以上的数据的批量导入支持不是很好，功能上需要修改很多配置，而且可能无法完成导入，性能上会比较慢，并且由于没有读写分离，需要占用较多的cpu等资源。而这种大数据量导入会在用户迁移的时候遇到，所以需要实现基于spark集群的导入功能，利用spark集群的并发能力，完成导入时的ETL计算，排序、聚合等等，满足用户大数据量导入需求，降低用户导入时间和迁移成本。
+
+在Spark导入中，需要考虑支持多种spark部署模式，设计上需要兼容多种部署方式，可以考虑先实现yarn集群的部署模式；同时，由于用户数据格式多种多样，需要支持包括csv、parquet、orc等多种格式的数据文件。
+
+### 实现方案
+
+在将spark导入的设计实现的时候，有必要讲一下现有的导入框架。现在有的导入框架，可以参考《Doris Broker导入实现解析》。
 
 Review comment:
   好的，我会尽快发出来。

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org