You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by ki...@apache.org on 2021/12/30 07:46:18 UTC

[dolphinscheduler-website] branch master updated: add eavy_info (#613)

This is an automated email from the ASF dual-hosted git repository.

kirs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/dolphinscheduler-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 7503b4b  add eavy_info (#613)
7503b4b is described below

commit 7503b4b5edc65ed5c31cf8976047ac23c0daf6f5
Author: lifeng <53...@users.noreply.github.com>
AuthorDate: Thu Dec 30 15:46:11 2021 +0800

    add eavy_info (#613)
    
    add eavy_info
---
 blog/en-us/Eavy_Info.md |  95 ++++++++++++++++++++++++++++++++++++++++++++
 blog/zh-cn/Eavy_Info.md | 102 ++++++++++++++++++++++++++++++++++++++++++++++++
 site_config/blog.js     |  14 +++++++
 site_config/home.jsx    |  28 ++++++-------
 4 files changed, 225 insertions(+), 14 deletions(-)

diff --git a/blog/en-us/Eavy_Info.md b/blog/en-us/Eavy_Info.md
new file mode 100644
index 0000000..685b8b4
--- /dev/null
+++ b/blog/en-us/Eavy_Info.md
@@ -0,0 +1,95 @@
+---
+title:  Eavy Info Builds Data Asset Management Platform Services Based on Apache DolphinScheduler to Construct Government Information Ecology
+keywords: Apache,DolphinScheduler,scheduler,big data,ETL,airflow,hadoop,orchestration, dataops,2.0.1
+description: Based on the Apache DolphinScheduler, the cloud computing and big data provider Eavy Info
+---
+# Eavy Info Builds Data Asset Management Platform Services Based on Apache DolphinScheduler to Construct Government Information Ecology | Use Case
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/29/1640759432737.md.png"/>
+</div>
+
+Based on the Apache DolphinScheduler, the cloud computing and big data provider Eavy Info has been serving the business operations in the company for more than a year.
+
+Combining with the government affairs informatization ecological construction business, Shandong Eavy Info built the data service module of its self-develop Asset Data Management and Control Platform based on Apache DolphinScheduler. How do they use Apache DolphinScheduler? Sun Hao, the R&D engineer of Evay Information, shared their experiences on their business practice.
+
+## R&D Background
+
+The prime operating of Eavy Info is focusing on ToG business, and data collection & sharing take a large proportion of their work. However, Traditional ETL tools, such as kettle, are not simple and easy enough to get started and employed for on-site project operation and maintenance by the front-line implementers. Therefore, creating a set of data acquisition (synchronization)-data processing-data management platform is particularly important.
+
+Out of this consideration, we have developed a Data Asset Management Platform, of which the core is a data service module based on Apache DolphinSchduler (referred to as DS below).
+
+Apache DolphinScheduler is a distributed, decentralized, easy-to-expand visual DAG scheduling system that supports multiple types of tasks including Shell, Python, Spark, Flink, etc., and has good scalability. Its overall structure is shown in the figure below:
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/28/1.md.png"/>
+</div>
+
+This is a typical master-slave architecture with strong horizontal scalability. The scheduling engine Quartz is a Java open source project of Spring Boot, it is easier to integrate and use for those familiar with Spring Boot development.
+
+As a scheduling system, DS supports the following functions:
+
+**Scheduling mode: ** The system supports timing scheduling and manual scheduling based on cron expressions. And it supports command types like workflow starting, execution starting from the current node, the fault-tolerant workflow resume, the paused process resume, execution starting from the failed node, complement, timing, rerun, pause, stop, and resume joinable threads. Among them, restoring the fault-tolerant workflow and restoring the joinable threads are two command types that ar [...]
+
+**Timing schedule:** The system uses quartz distributed scheduler and supports the visual generation of cron expressions.
+
+**Dependency: ** The system not only supports the dependency between the simple predecessor and successor nodes of the DAG but also provides task-dependent nodes to support custom task dependencies between processes.
+
+**Priority:** Support the priority of the process instance and task instance. If the priority of the process instance and task instance is not set, the default is first-in-first-out.
+
+**Email alert: ** Support SQL task query result email sending, process instance running result email alert, and fault tolerance alert notification.
+
+**Failure strategy:** For tasks that run in parallel, if there are tasks that fail, two failure strategy processing methods are provided. **Continue** refers to regardless of the status of the parallel running tasks until the end of the process failure. **End** means that once a failed task is found, the running parallel task will be killed at the same time, and the failed process will end.
+
+**Complement:** Complement historical data, support interval parallel, and serial complement methods.
+
+Based on Apache DolphinScheduler, we carry out the following practices.
+
+## Building A Data Synchronization Tool Based on DS
+
+In our business scenario, there are many types of business needs for data synchronization, but the amount of data is not particularly large and is real-time-undemanding. So at the beginning of the architecture selection, we chose the combination of Datax+Apache DolphinScheduler and implemented the transformation of the corresponding business. Now it is integrated into various projects as a service product to provide offline synchronization services.
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/29/1-1.md.png"/>
+</div>
+
+Synchronization tasks are divided into periodic tasks and one-time tasks. After the configuration tasks of the input and output sources, the corn expression needs to be configured for periodic tasks, and then the save interface is called to send the synchronization tasks to the DS scheduling platform.
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/29/2-1.md.png"/>
+</div>
+
+Synchronization tasks are divided into **periodic tasks** and **one-time tasks**. After the configuration tasks of the input and output sources are configured, the corn expression needs to be configured for periodic tasks, and then the **save interface** is called to send the synchronization tasks to the DS scheduling platform.
+
+We gave up the previous UI front-end of DS after comprehensive consideration and reused the DS back-end interfaces to carry the online procedure, start and stopping, deleting, and log viewing.
+
+The design of the entire synchronization module is aimed to reuse the diversity of input and output plugins of the Datax component and integrate with the optimization of DS to achieve an offline synchronization task. This is a component diagram of our current synchronization. 
+
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/29/9febb92b69077778c765d115427e3a48.md.png"/>
+</div>
+
+## Self Development Practices Based on DS
+
+Anyone familiar with Datax knows that it is essentially an ETL tool, which provides a transformer module that supports Groovy syntax, and at the same time further enrich the tool classes used in the transformer in the Datax source code, such as replacing, regular matching, screening, desensitization, statistics, and other functions. That shows its property of Transform. Since the tasks are implemented with DAG diagrams in Apache DolphinScheduler, we wonder that is it possible to abstract [...]
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/ffd0c839647bcce4c208ee0cf5b7622b.md.png"/>
+</div>
+
+Each component is regarded as a module, and the dependency between the functions of each module is dealt with the dependency of DS. The corresponding component and the component transfer data are stored at the front-end, which means the front-end performs the transfer and logical judgments between most of the components after introducing input (input component) , since each component can be seen as an output/output of Datax. Once all parameters are set, the final output is determined. Th [...]
+
+PS: Because our business scenarios may involve cross-database queries (MySQL combined query of different instances), our SQL component uses Presto to implement a unified SQL layer, so that you can also use Presto to do combined retrieval even when data sources are under various IP instances (business-related).
+
+## Other Attempts
+
+People dabble in the governance process know that a simple governance process can lead to a quality report. We write part of the government records into ES, and then use the aggregation capabilities of ES to obtain a quality report.
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/29/4da40632c21dbea51d2951d98ee18f1b.md.png"/>
+</div>
+
+The above are some practices that we have made based on DS and middlewares like Datax, combining with businesses to meet our own needs.
+
+From EasyScheduler to the current Apache DolphinScheduler 2.0, we are more often a spectator or follower, but today we shared our practical experience to build data service modules of Data Asset Management and Control Platform based on Apache DolphinScheduler. Currently, we have served the on-site operation of multiple project departments of the company based on the Apache DolphinScheduler scheduling platform for more than a year. With the release of Apache DolphinScheduler 2.0, we have  [...]
\ No newline at end of file
diff --git a/blog/zh-cn/Eavy_Info.md b/blog/zh-cn/Eavy_Info.md
new file mode 100644
index 0000000..8c727a4
--- /dev/null
+++ b/blog/zh-cn/Eavy_Info.md
@@ -0,0 +1,102 @@
+# 亿云基于 DolphinScheduler 构建资产数据管理平台服务,助力政务信息化生态建设 | 最佳实践
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/1639640547411.md.png"/>
+</div>
+作者| 孙浩
+
+✎ 编 者 按:基于 Apache Dolphinscheduler 调度平台,云计算和大数据提供商亿云信息已经服务公司多个项目部的地市现场平稳运行一年之久。
+结合政务信息化生态建设业务,亿云信息基于 DolphinScheduler 构建了资产数据管控平台的数据服务模块。他们是如何进行探索和优化的?亿云信息研发工程师 孙浩 进行了详细的用户实践交流分享。
+
+## 01 研发背景
+
+亿云主要的业务主要是 ToG 的业务,而业务前置的主要工作,在于数据的采集和共享,传统 ETL 工具,例如 kettle 等工具对于一线的实施人员的来说上手难度还是有的,再就是类似 kettle 的工具本身做为独立的部分,本身就增加了现场项目运维的使用难度。因此,如何实现一套数据采集(同步)—数据处理—数据管理的平台,就显得尤为重要。
+出于这样的考虑,我们开发了数据资产管理平台,而管理平台的核心就是我们基于DolphinSchduler(简称 DS)实现的数据服务模块。
+DolphinScheduler 是一个分布式去中心化,易扩展的可视化 DAG 调度系统,支持包括 Shell、Python、Spark、Flink 等多种类型的 Task 任务,并具有很好的扩展性。其整体架构如下图所示:
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/28/1.md.png"/>
+</div>
+
+典型的 master-slave 架构,横向扩展能力强,调度引擎是 Quartz,本身作为 Spring Boot 的 java 开源项目,对于熟悉 Spring Boot 开发的人,集成使用更加的简单上手。
+
+DS 作为调度系统支持以下功能:
+
+调度方式:系统支持基于 cron 表达式的定时调度和手动调度。命令类型支持:启动工作流、从当前节点开始执行、恢复被容错的工作流、恢复暂停流程、从失败节点开始执行、补数、定时、重跑、暂停、停止、恢复等待线程。其中恢复被容错的工作流和恢复等待线程两种命令类型是由调度内部控制使用,外部无法调用。
+
+定时调度:系统采用 quartz 分布式调度器,并同时支持 cron 表达式可视化的生成。
+
+依赖:系统不单单支持 DAG 简单的前驱和后继节点之间的依赖,同时还提供任务依赖节点,支持流程间的自定义任务依赖。
+
+优先级 :支持流程实例和任务实例的优先级,如果流程实例和任务实例的优先级不设置,则默认是先进先出。
+
+邮件告警:支持 SQL任务 查询结果邮件发送,流程实例运行结果邮件告警及容错告警通知。
+
+失败策略:对于并行运行的任务,如果有任务失败,提供两种失败策略处理方式,继续是指不管并行运行任务的状态,直到流程失败结束。结束是指一旦发现失败任务,则同时Kill掉正在运行的并行任务,流程失败结束。
+
+补数:补历史数据,支持区间并行和串行两种补数方式。
+
+我们基于 Dolphinscheduler 与小伙伴一起进行如下的实践。
+
+## 02 基于DS构建数据同步工具
+
+回归业务本身,我们的业务场景,数据同步的业务需要是类型多,但数据量基本不会特别大,对实时要求并不高。所以在架构选型之初,我们就选择了 datax+ds 的组合,并进行对应业务的改造实现,现在作为服务产品融合在各个项目中,提供离线同步服务。
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/1.md.png"/>
+</div>
+
+同步任务分为了周期任务和一次性任务,在配置完成输入输出源的配置任务之后,周期任务的话,需要配置 corn 表达式,然后调用保存接口,将同步任务发送给DS 的调度平台。
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/2.md.png"/>
+</div>
+我们这里综合考虑放弃了之前 DS 的 UI 前端(第二部分在自助开发模块会给大家解释),复用 DS 后端的上线、启停、删除、日志查看等接口。
+
+<div align=center>
+< img src="https://imgpp.com/images/2021/12/30/4.md.png"/>
+</div>
+
+<div align=center>
+< img src="https://imgpp.com/images/2021/12/30/5.md.png"/>
+</div>
+
+整个同步模块的设计思路,就是重复利用 datax 组件的输入输出 plugin 多样性,配合 DS 的优化,来实现一个离线的同步任务,这个是当前我们的同步的一个组件图,实时同步这块不再赘述。
+
+<div align=center>
+< img src="https://imgpp.com/images/2021/12/30/9.md.png"/>
+</div>
+
+## 03 基于DS的自助开发实践
+
+熟悉 datax的人都知道它本质上是一个 ETL 工具,而其 Transform 的属性体现在,它提供了一个支持 grovy 语法的 transformer 模块,同时可以在 datax 源码中进一步丰富 transformer 中用到工具类,例如替换、正则匹配、筛选、脱敏、统计等功能。而 Dolphinscheduler 的任务,是可以用 DAG 图来实现,那么我们想到,是否存在一种可能,针对一张表或者几张表,把每个 datax 或者 SQL 抽象成一个数据治理的小模块,每个模块按照 DAG 图去设计,并且在上下游之间可以实现数据的传递,最好还是和 DS 一样的可以拖拽式的实现。于是,我们基于前期对 datax 与 ds 的使用,实现了一个自助开发的模块。
+
+<div align=center>
+< img src="https://imgpp.com/images/2021/12/30/6.md.png"/>
+</div>
+
+每个组件可能是一个模块,每个模块功能之间的依赖关系,我们利用 ds 的depend 来处理,而对应组件与组件传递数据,我们利用前端去存储,也就是我们在引入 input(输入组件)之后,让前端来进行大部分组件间的传递和逻辑判断,因为每个组件都可以看作一个 datax 的(输出/输出),所有参数在输入时,最终输出的全集基本就确定了,这也是我们放弃 DS 的 UI 前端的原因。之后,我们将这个 DAG 图组装成 DS 的定义的类型,同样交付给 ds 任务中心。
+
+PS:因为我们的业务场景可能存在跨数据库查询的情况(不同实例的 mysql 组合查询),我们的 SQL 组件底层使用 Presto 来实现一个统一 SQL 层,这样即使是不同 IP 实例下的数据源(业务上有关联意义),也可以通过 Presto 来支持组合检索。
+
+## 04 其他的一些简单尝试
+熟悉治理流程的人都知道,如果能够做到简单的治理流程化,那么必然可以产出一份质量报告。我们在自助开发的基础上进行优化,将部分治理的记录写入 ES 中,再利用 ES 的聚合能力来实现了一个质量报告。
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/7.md.png"/>
+</div>
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/8.md.png"/>
+</div>
+
+<div align=center>
+<img src="https://imgpp.com/images/2021/12/30/10.md.png"/>
+</div>
+
+以上便是我们使用 DS 结合 datax 等中间件,并结合业务背景所做的一些符合自身需求的实践。
+
+## 05 感谢
+
+从 EasyScheduler 到现在的 DolphinScheduler 2.0,我们更多时候还是作为旁观者,或者是追随者,而这次更多地是从这一年来使用 Dolphinscheduler 构建我们数据资产管控平台数据服务模块的实践来进行交流分享。当前基于Dolphinscheduler 调度平台,我们已经服务了公司多个项目部的地市现场运行。随着 DolphinScheduler 2.0 发版,我们也和 DolphinScheduler 一起在不断进步的社区环境中共同成长。
+
+## 06 欢迎更多实践分享
+如果你也是 Apache DolphinScheduler 的用户或实践者,欢迎投稿或联系社区分享你们的实践经验,开源共享,互帮互助!
diff --git a/site_config/blog.js b/site_config/blog.js
index e1b205d..9c9df9b 100644
--- a/site_config/blog.js
+++ b/site_config/blog.js
@@ -4,6 +4,13 @@ export default {
         postsTitle: 'All posts',
         list: [
             {
+                title: 'Eavy Info Builds Data Asset Management Platform Services Based on Apache DolphinScheduler to Construct Government Information Ecology',
+                author: 'Debra Chen',
+                dateStr: '2021-12-30',
+                desc: ' Use Case',
+                link: '/en-us/blog/Eavy_Info.html',
+            },
+            {
                 title: 'Apache DolphinScheduler 2.0.1 is here, and the highly anticipated one-click upgrade and plug-in finally come!',
                 author: 'Debra Chen',
                 dateStr: '2021-12-20',
@@ -83,6 +90,13 @@ export default {
         postsTitle: '所有文章',
         list: [
             {
+                title: '亿云基于 DolphinScheduler 构建资产数据管理平台服务,助力政务信息化生态建设',
+                author: 'Debra Chen',
+                dateStr: '2021-12-30',
+                desc: '最佳实践',
+                link: '/zh-cn/blog/Eavy_Info.html',
+            },
+            {
                 title: 'Apache DolphinScheduler 2.0.1 来了,备受期待的一键升级、插件化终于实现!',
                 author: 'Debra Chen',
                 dateStr: '2021-12-17',
diff --git a/site_config/home.jsx b/site_config/home.jsx
index 1c1173c..a3b4d15 100644
--- a/site_config/home.jsx
+++ b/site_config/home.jsx
@@ -55,6 +55,13 @@ export default {
       title: '事件 & 新闻',
       list: [
         {
+          img: 'https://imgpp.com/images/2021/12/29/1640759432737.md.png',
+          title: '亿云基于 DolphinScheduler 构建资产数据管理平台服务,助力政务信息化生态建设',
+          content: '基于 Apache Dolphinscheduler 调度平台,云计算和大数据提供商亿云...',
+          dateStr: '2021-12-30',
+          link: '/zh-cn/blog/Eavy_Info.html',
+        },
+        {
           img: 'https://imgpp.com/images/2021/12/17/1639647220322.md.png',
 
           title: 'Apache DolphinScheduler 2.0.1 来了,备受期待的一键升级、插件化终于实现!',
@@ -70,13 +77,6 @@ export default {
           dateStr: '2021-12-10',
           link: '/zh-cn/blog/YouZan-case-study.html',
         },
-        {
-          img: 'https://imgpp.com/images/2021/11/23/1637566412753.md.png',
-          title: '荔枝机器学习平台与大数据调度系统“双剑合璧”,打造未来数据处理新模式!',
-          content: '在线音频行业在中国仍是蓝海一片。根据 CIC 数据显示...',
-          dateStr: '2021-11-23',
-          link: '/zh-cn/blog/Lizhi-case-study.html',
-        },
       ],
     },
     ourusers: {
@@ -543,6 +543,13 @@ export default {
       title: 'Events & News',
       list: [
         {
+          img: 'https://imgpp.com/images/2021/12/29/1640759432737.md.png',
+          title: '# Eavy Info Builds Data Asset Management Platform Services Based on Apache DolphinScheduler to Construct Government Information Ecology',
+          content: 'Based on the Apache DolphinScheduler, the cloud computing and big data provider Eavy Info...',
+          dateStr: '2021-11-24',
+          link: '/en-us/blog/Eavy_Info.html',
+        },
+        {
           img: 'https://imgpp.com/images/2021/12/20/1639994965225.md.png',
           title: 'Apache DolphinScheduler 2.0.1 is here, and the highly anticipated one-click upgrade and plug-in finally come!',
           content: 'Good news! Apache DolphinScheduler 2.0.1 version is officially released today!...',
@@ -556,13 +563,6 @@ export default {
           dateStr: '2021-12-16',
           link: '/en-us/blog/YouZan-case-study.html',
         },
-        {
-          img: 'https://imgpp.com/images/2021/11/23/1637566412753.md.png',
-          title: 'A Formidable Combination of Lizhi Machine Learning Platform& DolphinScheduler Creates New Paradigm for Data Process in the Future!',
-          content: 'The online audio industry is a blue ocean market in China nowadays...',
-          dateStr: '2021-11-24',
-          link: '/en-us/blog/Lizhi-case-study.html',
-        },
       ],
     },
     userreview: {