You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/03/01 03:00:07 UTC

[GitHub] [incubator-doris] morningman commented on a change in pull request #3013: add broker load internal doc

morningman commented on a change in pull request #3013: add broker load internal doc
URL: https://github.com/apache/incubator-doris/pull/3013#discussion_r386071914
 
 

 ##########
 File path: docs/documentation/cn/internal/broker_load.md
 ##########
 @@ -0,0 +1,1014 @@
+# Doris Broker导入实现解析
+
+## 背景
+
+Doris支持多种导入方式,其中Broker导入是一种最常用的方式,用于实现将分布式存储系统(hdfs、bos等)中的文件导入到doris中。 Broker导入适用的场景是:
+
+- 源数据在Broker可以访问的分布式存储系统中,如HDFS。
+
+- 数据量在20G级别。
+
+## 名词解释
+
+* FE:Frontend,即 Palo 的前端节点。主要负责接收和返回客户端请求、元数据以及集群管理、查询计划生成等工作。关于Doris的架构图,参考[Doris架构介绍](http://doris.incubator.apache.org/)
+* BE:Backend,即 Palo 的后端节点。主要负责数据存储与管理、查询计划执行等工作。
+* Broker:请参考[Broker文档](http://doris.incubator.apache.org/documentation/cn/administrator-guide/broker.html)
+
+## 实现原理
+
+在Broker导入中,用户只要提供一份base表的数据,Doris会为用户进行一下处理:
+
+- doris会自动基于base的的数据,为用户生成rollup表的数据,导入到对应的rollup中
+- 实现负导入功能(仅针对聚合模型的SUM类型的value)
+- 从path中提取字段
+- 函数计算,包括strftime,now,hll_hash,md5等
+- 保证导入整个过程的原子性
+
+Broker load的语法以及使用方式,请参考[Broker导入文档](http://doris.incubator.apache.org/documentation/cn/administrator-guide/load-data/broker-load-manual.html)
+
+### 导入流程
+
+```
+                 +
+                 | 1. user create broker load
+                 v
+            +----+----+
+            |         |
+            |   FE    |
+            |         |
+            +----+----+
+                 |
+                 | 2. BE etl and load the data
+    +--------------------------+
+    |            |             |
++---v---+     +--v----+    +---v---+
+|       |     |       |    |       |
+|  BE   |     |  BE   |    |   BE  |
+|       |     |       |    |       |
++---^---+     +---^---+    +---^---+
+    |             |            |
+    |             |            | 3. pull data from broker
++---+---+     +---+---+    +---+---+
+|       |     |       |    |       |
+|Broker |     |Broker |    |Broker |
+|       |     |       |    |       |
++---^---+     +---^---+    +---^---+
+    |             |            | 
++----------------------------------+
+|       HDFS/BOS/AFS cluster       |
++----------------------------------+
+```
+
+整个导入过程大体如下:
+
+- 用户将请求发送到FE,经过FE进行语法和语意分析,之后生成BrokerLoadJob
+- BrokerLoadJob会经过LoadJob的Scheduler调度,生成一个BrokerLoadPendingTask
+- BrokerLoadPendingTask会对导入源文件进行list,并且按照partition进行构建partition下文件列表
+- 每个partition生成一个LoadLoadingTask,进行导入
+- LoadLoadingTask生成一个分布式的导入执行计划,在后端BE中执行读取源文件,进行ETL转化,写入对应的tablet的过程。
+
+其中关键步骤如下:
+
+#### FE中的处理
+1. 语法和语意处理
+
+```
+			 User Query
+                 +
+                 | mysql protocol
+                 v
+         +-------+-------+
+         |               |
+         |   QeService   |
+         |               |
+         +-------+-------+
+				 |
+                 v
+         +-------+-------+
+         |               |
+         |  MysqlServer  |
+         |               |
+         +-------+-------+
+				 |
+                 v
+       +---------+---------+
+       |                   |
+       |  ConnectScheduler |
+       |                   |
+       +---------+---------+
+				 |
+                 v
+       +---------+---------+
+       |                   |
+       |  ConnectProcessor |
+       |                   |
+       +---------+---------+
+				 |
+                 v
+         +-------+-------+
+         |               |
+         | StmtExecutor  |
+         |               |
+         +-------+-------+
+```
+上述流程,是一个查询发送到Doris之后,进行语法和语意分析所经过的处理流程。其中,在Doris中,MysqlServer是实现了Mysql Protocol的一个server,用户接收用户的mysql查询请求,经过ConnectScheduler的调度之后,有ConnectProcessor处理,并且最终由StmtExecutor进行语法和语意分析。
+
+2. Load job执行
+
+```
+         +-------+-------+
+         |    PENDING    |-----------------|
+         +-------+-------+                 |
+				 | BrokerLoadPendingTask   |
+                 v                         |
+         +-------+-------+                 |
+         |    LOADING    |-----------------|
+         +-------+-------+                 |
+				 | LoadLodingTask          |
+                 v                         |
+         +-------+-------+                 |
+         |  COMMITTED    |-----------------|
+         +-------+-------+                 |
+				 |                         |
+                 v                         v  
+         +-------+-------+         +-------+-------+     
+         |   FINISHED    |         |   CANCELLED   |
+         +-------+-------+         +-------+-------+
+				 |                         Λ
+                 |-------------------------|
+```
+
+用户发起的Broker导入的请求,最终在StmtExecutor经过语法和语意分析之后,会生成LoadStmt,然后在DdlExecutor中,会根据LoadStmt生成BrokerLoadJob。
 
 Review comment:
   ```suggestion
   用户发起的Broker导入的请求,最终在StmtExecutor经过语法和语义分析之后,会生成LoadStmt,然后在DdlExecutor中,会根据LoadStmt生成BrokerLoadJob。
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org