You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/01/06 09:34:31 UTC

[GitHub] [incubator-dolphinscheduler] lenboo opened a new issue #4355: [Feature][Master+API+Scheduler] Propose for master refactor and scheduler module

lenboo opened a new issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355


   ## Backgroud:
   
   refer: #4083
   
   At present, the problems of master:
   
   There are many polling, that result in unnecessary time-cost
   
   The distributed lock is used when the command is taken, that result in the bottleneck of concurrency
   
   Too many threads(nProcessInstance*nTaskInstances) are used, that result in the waste of system resources
   
   Polling database result in database query pressure bottleneck
   
   here are the propose:
   
   ## 1. API
   
   - API receives the execution workflow command or pause / stop command, and sends the command to the specified scheduler / master. If it fails, it will try again three times. If it fails three times, it will throw a failure.
   
   ## 2. Reconstruct the communication function
   
   - Synchronization: sending thread sends message (blocking) - > the receiver receives the message and processes the message (storing dB, writing data, etc.) - > the receiver returns the message to the sender - > the sender unblocks
   
   - Asynchronous: send thread send message, send thread cache message, receive message and process message, reply to sender command after processing message, remove cache message after receiving command
   
   ## 3. Add scheduler function
   
   - Realize ha function
   
   - To implement the scheduler start process, scan the CMD table first, and then start the monitoring function of CMD to cache the CMD data to the local queue
   
   - Monitor CMD receive processing flow (synchronous) / timing (asynchronous)
   
   - CMD caches the queue processing thread and sends the CMD to the master
   
   - The implementation of CMD sending policy can support multiple policies at the same time, and can be easily extended, such as: the priority of CMD, the load of master..
   
   ![image](https://user-images.githubusercontent.com/29528966/103751033-a0286e80-5042-11eb-8ff6-4baf1b76d536.png)
   
   
   ## 4. Fault tolerant process modification:
   
   - Master fault tolerance
   
   - Workflow instance responsible for fault tolerant master: find the unfinished workflow instance, generate fault tolerant CMD, and send it to active scheduler
   
   - Find the unfinished task instance that the master is responsible for and check (whether the task instance worker is alive + worker start time < task start time)
   
   - Find the unfinished command and send it to the scheduler for reprocessing (remove the host and reassign the master)
   
   - Worker fault tolerance
   
   - When the worker fails, each master is responsible for his own task and skips the task that does not belong to him.
   
   ## 5. Master execution process
   
   - Modify the process pool of master processing workflow, from obtaining CMD to generating workflow instance, to submitting, starting and task ending
   
   * When the master submits a task, it is found that the task is being executed. It is necessary to inform the worker of the task and change the host reported by the worker to the current master
   
   - Add master task status monitoring, receive task / workflow status from worker / API / master, and save the received task status to local status queue.
   
   - Add the thread pool of master task state processing, all task states of the same workflow can only be processed sequentially
   
   * When the master receives a task, it needs to determine whether the DAG to which the task belongs is processed by the master. If not, it cannot be processed.
   
   - The polling thread is added to poll for the requirement of knowing the external workflow / task status (dependency / sub workflow).
   
   - Add a timeout monitoring queue, add the task / workflow that needs to be monitored to the queue, and a time wheel / thread will monitor the timeout. If a timeout occurs, the timeout processing will be triggered.
   
   - The master monitors the CMD thread, marks the CMD, and caches the CMD to the local CMD queue.
   
   - The CMDS that cannot be processed by the master can be fed back to the scheduler for redistribution
   
   - Actively report the resource usage, and the master reports the resource usage to the scheduler.
   
   ![image](https://user-images.githubusercontent.com/29528966/103524201-1a78b780-4eb8-11eb-95bc-a7c0dbc5d2af.png)
   
   ## 6. Timing
   
   1. Master timing: to prevent a timing from triggering multiple times: there may be multiple times of timing when the master sends the timing to the scheduler.
   
   - Add a unique index (definitionid + schedulertime + datetime) to the CMD table and the workflow instance table to prevent duplication.
   
   ==================================================================================================
   
   ## 1. API部分
      
      - api收到执行工作流命令,或者暂停/停止命令,将命令发送给指定的scheduler/master,失败则重试三次,三次失败以后抛出失败。
    
   ## 2. 重构通信功能   
   
   -  同步: 发送线程send消息(阻塞) -> 接收方收到消息,并且处理消息(存db,写数据等)->接收方返回消息给发送方 -> 发送方解除阻塞
       
   - 异步: 发送线程send消息->发送线程缓存消息-> 接收方收到消息,并处理->处理完消息回复发送方command -> 发送方收到command,移除缓存消息
     
   ## 3. 新增scheduler功能
   
     - 实现ha功能
     - 实现scheduler启动流程,先扫描cmd表,再启动监听cmd功能,将cmd数据缓存到本地队列
     - 监听cmd接收处理流程(同步)/定时(异步)
     - cmd缓存队列处理线程,将cmd发送给master
     - 实现cmd发送策略,可以同时支持多个策略,且可以易扩展策略,例如:cmd的优先级,master的负载。。
   
   ![image](https://user-images.githubusercontent.com/29528966/103750880-622b4a80-5042-11eb-8523-5cddbccd9e09.png)
   
   
   ## 4. 容错流程修改:
     - master容错
       - 容错master负责的工作流实例: 找到未完成的工作流实例,生成容错cmd,将cmd发给active的scheduler. 
       - 找到master负责的未完成的任务实例,检查(任务实例worker是否活着+worker启动时间<任务开始时间)
       - 找到还未处理完的command,发送到scheduler进行重新处理(去掉host,重新分配master)
     
     - worker容错
       - worker挂掉,每个master负责容错自己的任务,跳过不属于自己的任务。
       
   ## 5. master执行流程
     - 修改master处理工作流线程池,从获取cmd到生成工作流实例,到提交完开始任务结束
     
         * master提交任务的时候,发现此任务正在执行,需要通知任务所在的worker,将worker汇报的host更换到当前master
     
     - 增加master任务状态监听,从worker/api/master收到任务/工作流状态,将接收到的任务状态存到本地状态队列。
     
     - 增加master任务状态处理线程池,同一个工作流的所有任务状态只能顺序处理
     
        * master接收到任务,需要判断此任务所属的DAG是否本master处理,如不是,则不能处理。
     
     - 增加轮询线程,针对需要知道外部工作流/任务状态的需求(依赖/子工作流),进行轮询。
     
     - 增加超时监控队列,将需要监控的任务/工作流加入队列,由一个时间轮/线程进行超时监控,发生超时,则触发超时处理。
   
     - master监听cmd线程,对cmd标记,cmd缓存到本地cmd队列。
     - master处理不完的cmd可以反馈回scheduler进行重新分发
     - 主动上报资源使用情况,master向scheduler汇报资源使用情况。
   
   ![image](https://user-images.githubusercontent.com/29528966/103628867-8c193a00-4f7a-11eb-9762-5d4cb20aabaf.png)
   
   
   
   ## 6. 定时
   1. master定时:防止一个定时触发多次:master将定时发给scheduler过程中可能会有多次定时出现。
      - 对cmd表和工作流实例表增加唯一索引(definitionId+schedulerTime+dateTime)防止产生重复问题。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo commented on issue #4355: [Feature][Master+API+Scheduler]master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo commented on issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355#issuecomment-752986590


   the whole structure:
   
   ![image](https://user-images.githubusercontent.com/29528966/103415761-958f5480-4bbe-11eb-80f9-1cf68a1942d6.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo removed a comment on issue #4355: [Feature][Master+API+Scheduler] Propose for master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo removed a comment on issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355#issuecomment-753884820


   master process: 
   
   ![image](https://user-images.githubusercontent.com/29528966/103524201-1a78b780-4eb8-11eb-95bc-a7c0dbc5d2af.png)
   
   
   ![image](https://user-images.githubusercontent.com/29528966/103628867-8c193a00-4f7a-11eb-9762-5d4cb20aabaf.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo commented on issue #4355: [Feature][Master+API+Scheduler]master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo commented on issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355#issuecomment-753884209


   ![image](https://user-images.githubusercontent.com/29528966/103524097-ef8e6380-4eb7-11eb-808c-9373a94c2ec0.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo edited a comment on issue #4355: [Feature][Master+API+Scheduler] Propose for master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo edited a comment on issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355#issuecomment-753884820


   master process: 
   
   ![image](https://user-images.githubusercontent.com/29528966/103524201-1a78b780-4eb8-11eb-95bc-a7c0dbc5d2af.png)
   
   
   ![image](https://user-images.githubusercontent.com/29528966/103628867-8c193a00-4f7a-11eb-9762-5d4cb20aabaf.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo removed a comment on issue #4355: [Feature][Master+API+Scheduler]master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo removed a comment on issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355#issuecomment-753884209


   ![image](https://user-images.githubusercontent.com/29528966/103524097-ef8e6380-4eb7-11eb-808c-9373a94c2ec0.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo closed issue #4355: [Feature][Master+API+Scheduler]master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo closed issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo closed issue #4355: [Feature][Master+API+Scheduler] Propose for master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo closed issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-dolphinscheduler] lenboo commented on issue #4355: [Feature][Master+API+Scheduler]master refactor and scheduler module

Posted by GitBox <gi...@apache.org>.
lenboo commented on issue #4355:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4355#issuecomment-753884820


   ![image](https://user-images.githubusercontent.com/29528966/103524201-1a78b780-4eb8-11eb-95bc-a7c0dbc5d2af.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org