You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2021/02/09 10:27:08 UTC

[GitHub] [incubator-dolphinscheduler] wjsshide opened a new issue #4754: Master容错

wjsshide opened a new issue #4754:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4754


   分支：master
   问题：
   worker通过rpc向master发送ack和response，当master挂掉之后，worker会重新找一个master进行发送，master收到信息一方面写到eventQueue中，一方面放到taskCache中，但是taskCache只是在循环获取task状态的时候才会用到。对于新的 master是没有维护上一个master的任务信息的，是我代码没有覆盖到  还是目前没有做针对master挂掉后对流程实例或者任务实例的重试。谢谢
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] dailidong closed issue #4754: What is the fault tolerance process of master? (Master怎么进行容错)

Posted by GitBox <gi...@apache.org>.

dailidong closed issue #4754:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4754


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] zhanguohao commented on issue #4754: Master容错

Posted by GitBox <gi...@apache.org>.

zhanguohao commented on issue #4754:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4754#issuecomment-775925596


   When the Master hangs up, other online Masters will receive the ZK Master node remove event to perform fault tolerance. First, they will query all the process instances that need fault tolerance on the dead Master, and then generate fault tolerance Commands and write them into the command table of the database.
   Then after the normal running Master obtains the lock, it gets the fault-tolerant Command in the database, starts execution, initialization, DAG construction, etc., and then finds all the head nodes. If the task status has been completed, continue to look down until the process instance is found Task node running
   As a result, the new Master takes over the process instance and continues to monitor the task status. After the current task node is executed, it continues to dispatch tasks
   -------------------------------
   当Master 挂掉以后，其他在线的 Master 会收到 ZK Master节点 remove 事件，进行容错，首先会查询 死掉的Master 上面所有需要容错的 流程实例，然后生成容错Command，写入数据库的command表中
   然后正常运行的Master 获取锁以后，到数据库中拿到容错Command，开始执行，初始化，DAG构建等，然后找出所有头结点，如果任务状态已经完成，继续向下查找，直到找到流程实例正在运行的任务节点
   由此新的Master 就接管了这个流程实例，继续监控任务状态，当前任务节点执行完毕以后，继续派发任务


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] zhanguohao edited a comment on issue #4754: Master容错

Posted by GitBox <gi...@apache.org>.

zhanguohao edited a comment on issue #4754:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4754#issuecomment-775925596


   When the Master hangs up, other online Masters will receive the ZK Master node remove event to perform fault tolerance. First, they will query all the process instances that need fault tolerance on the dead Master, and then generate fault tolerance Commands and write them into the command table of the database.
   Then after the normal running Master obtains the lock, it gets the fault-tolerant Command in the database, starts execution, initialization, DAG construction, etc., and then finds all the head nodes. If the task status has been completed, continue to look down until the process instance is found Task node running
   As a result, the new Master takes over the process instance and continues to monitor the task status. After the current task node is executed, it continues to dispatch tasks
   
   当Master 挂掉以后，其他在线的 Master 会收到 ZK Master节点 remove 事件，进行容错，首先会查询 死掉的Master 上面所有需要容错的 流程实例，然后生成容错Command，写入数据库的command表中
   然后正常运行的Master 获取锁以后，到数据库中拿到容错Command，开始执行，初始化，DAG构建等，然后找出所有头结点，如果任务状态已经完成，继续向下查找，直到找到流程实例正在运行的任务节点
   由此新的Master 就接管了这个流程实例，继续监控任务状态，当前任务节点执行完毕以后，继续派发任务


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] zhanguohao edited a comment on issue #4754: What is the fault tolerance process of master? (Master怎么进行容错)

Posted by GitBox <gi...@apache.org>.

zhanguohao edited a comment on issue #4754:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4754#issuecomment-775925596


   When the Master hangs up, other online Masters will receive the ZK Master node remove event to perform fault tolerance. First, they will query all the process instances that need fault tolerance on the dead Master, and then generate fault tolerance Commands and write them into the command table of the database.
   Then after the normal running Master obtains the lock, it gets the fault-tolerant Command in the database, starts execution, initialization, DAG construction, etc., and then finds all the head nodes. If the task status has been completed, continue to look down until the process instance is found Task node running
   As a result, the new Master takes over the process instance and continues to monitor the task status. After the current task node is executed, it continues to dispatch tasks
   
   ---
   
   当Master 挂掉以后，其他在线的 Master 会收到 ZK Master节点 remove 事件，进行容错，首先会查询 死掉的Master 上面所有需要容错的 流程实例，然后生成容错Command，写入数据库的command表中
   然后正常运行的Master 获取锁以后，到数据库中拿到容错Command，开始执行，初始化，DAG构建等，然后找出所有头结点，如果任务状态已经完成，继续向下查找，直到找到流程实例正在运行的任务节点
   由此新的Master 就接管了这个流程实例，继续监控任务状态，当前任务节点执行完毕以后，继续派发任务


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #4754: What is the fault tolerance process of master? (Master怎么进行容错)

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #4754:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/4754#issuecomment-778105303


   > When the Master hangs up, other online Masters will receive the ZK Master node remove event to perform fault tolerance. First, they will query all the process instances that need fault tolerance on the dead Master, and then generate fault tolerance Commands and write them into the command table of the database.
   > Then after the normal running Master obtains the lock, it gets the fault-tolerant Command in the database, starts execution, initialization, DAG construction, etc., and then finds all the head nodes. If the task status has been completed, continue to look down until the process instance is found Task node running
   > As a result, the new Master takes over the process instance and continues to monitor the task status. After the current task node is executed, it continues to dispatch tasks
   > 
   > 当Master 挂掉以后，其他在线的 Master 会收到 ZK Master节点 remove 事件，进行容错，首先会查询 死掉的Master 上面所有需要容错的 流程实例，然后生成容错Command，写入数据库的command表中
   > 然后正常运行的Master 获取锁以后，到数据库中拿到容错Command，开始执行，初始化，DAG构建等，然后找出所有头结点，如果任务状态已经完成，继续向下查找，直到找到流程实例正在运行的任务节点
   > 由此新的Master 就接管了这个流程实例，继续监控任务状态，当前任务节点执行完毕以后，继续派发任务
   
   good question and answer


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org