You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/08/16 05:34:24 UTC
[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #5992: [Feature][DataX] Enhance using experience of DataX in DS by integeration with TIS

github-actions[bot] commented on issue #5992:
URL: https://github.com/apache/dolphinscheduler/issues/5992#issuecomment-899226908


   **Describe the  #feature**
   As everyone knows, DataX has become one of the most popular tools for big data extraction. Of course, we found that some RDBMS plug-ins of DataX ('MYSQL','POSTGRESQL','ORACLE','SQLSERVER','CLICKHOUSE') have been supported in dolphinscheduler, but there are still some shortcomings, such as not being able to do so The DataX plug-in is fully covered and only supports the stand-alone version.
   
   How to use data extraction in dolphinscheduler is one of the directions I have been working on recently. We know that the advantage of dolphinscheduler lies in the scheduling of DAG task nodes rather than data extraction.
   
   So my solution is to separate the data extraction component from dolphinscheduler and let dolphinscheduler focus more on DAG task scheduling. I have developed TIS, a data extraction mid-stage product based on DataX. The advantage is that it covers most of the DataX plug-ins, draws on the architecture design of the Jenkins microkernel, supports distributed task execution, and so on. In fact, such implementation schemes abound in the real world. For example, in the Transformers movie, in order to create a stronger Autobot, multiple autobots will be combined into a powerful robot.
   
   In this Feature, I will describe how to implement the TIS and dolphinscheduler
   
   **Describe the solution you'd like**
   
   1. First create a DataX instance in TIS, and you will eventually get a unique task name in the whole show
   
   2. Add a task node of TIS to the workflow of dolphinscheduler, just enter the task name in TIS in the node entry form
   
   3. Complete the construction of the dolphinscheduler task flow, trigger execution, and build a new Task in DS base on the doc [https://dolphinscheduler.apache.org/zh-cn/development/plugin-development.html](https://dolphinscheduler.apache.org/zh-cn/development/plugin-development.html)，in the Task the flowing will be execute
         1. trigger TIS dataX job by send http apply to TIS 
         2. launch a websocket client in order to  listen the executing log which generated by TIS
         3. polling visit TIS task status which has triggered at step 1，if the status has become `success` or `faild` ,then immediate terminate the polling process.
   
   <hr>
   
   **功能描述**
   
   正如大家所知，DataX已经成为大数据数据抽取环节最流行的工具之一。当然，我们发现在dolphinscheduler中已经支持了DataX的部分RDBMS的插件（'MYSQL','POSTGRESQL', 'ORACLE', 'SQLSERVER', 'CLICKHOUSE'），但是 目前还有一些缺陷，比如，没有做到DataX 插件全覆盖，仅仅支持支持单机版。
   
   如何在dolphinscheduler中畅快地使用数据抽取是我最近一直在努力的方向之一，我们知道dolphinscheduler的优势在于DAG任务节点的调度而非数据抽取。
   
   所以我的解决方案是将数据抽取组件从dolphinscheduler剥离出来，让dolphinscheduler更加专注于DAG任务调度。我已经开发了基于DataX的数据抽取中台产品TIS，优势是覆盖大部分DataX的插件，借鉴了Jenkins微内核的架构设计，支持分布式任务执行等等。其实这种实现方案在现实世界中比比皆是，例如，在变形金刚电影中，为了打造一个更强大的汽车人，会由多个小汽车人组合成一个强大的机器人。
   
   在本Feature中我会描述如何将TIS与dolphinscheduler的实现方案
   
   **实现描述**
   1.首先在TIS中创建一个DataX实例，最终会得到全场唯一的任务名
   
   2.在dolphinscheduler的工作流中添加一个TIS的任务节点，在节点录入表中输入TIS中的任务名称即可
   
   3、完成dolphinscheduler任务流程的搭建，触发执行，并根据doc[https://dolphinscheduler.apache.org/zh-cn/development/plugin-development.html](https://dolphinscheduler.apache.org/zh-cn/development/plugin-development.html)在DS中新建一个Task ://dolphinscheduler.apache.org/zh-cn/development/plugin-development.html)，在Task中执行flowing
          1.通过向TIS发送http申请来触发TIS dataX作业
          2.启动websocket客户端监听TIS生成的执行日志
          3. 轮询第1步触发的访问TIS任务状态，如果状态变为“成功”或“失败”，则立即终止轮询过程，并且将执行结果汇报给dolphinscheduler系统


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org