You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2019/12/31 08:53:24 UTC

[GitHub] [incubator-dolphinscheduler] Technoboy- opened a new issue #1658: Refactor WorkerServer

Technoboy- opened a new issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658

# Background
WorkerServer executes task by scanning ZK and DB. When WorkerServer starts, it try to retrive the lock in zk, and then executes task by loading data from DB. This is not nice for distributing system, and the current implementation will result in delay executing task.

# Suggestion
We wanna use tcp channel to refactor WorkerServer.

# General Implementation Idea
1. Using Netty for our tcp framework.
2. MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
3. WorkerServer will start up as predefined group and register itself to zk node.
4. WorkerServer will start a tcp server listening port for executing task instead of scanning ZK and DB.
5. Executing result will send back to the MasterServer node using the previous channel.

# General Failover Idea
1. For WorkerServer, only it receives the task command and gives back the ack command to keep the task is acknowledged.
2. If the WorkerServer executes the task normally, it will send back the result by the previous channel.
3. If the WorkerServer died after receiving a task, MasterServer will use execution-timeout time to ping WorkerServer to detect liveness. If ping failed, try another worker node. In this case, task may execute more than once.
4. If the MasterServer died after sending out the a task, WorkerServer will retry to rebuild the channel with N times to the original MasterServer. If failed after retry times, choose a new MasterServer to send back the result. New MasterServer will analysis the task, decide the next process. (Stop or continue execute by instanceId/processId, or just update the status)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- edited a comment on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- edited a comment on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569891456
 
 
   If the suggestion takes into consideration, I will lead the job with the help of @qiaozhanwei 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569932318
 
 
   > If the WorkerServer died after receiving a task, MasterServer will use execution-timeout time to ping WorkerServer to detect liveness. If ping failed, try another worker node. In this case, task may execute more than once.
   
   I have a question about this, why not use zookeeper to monitor whether the worker alive or not ?  ping method seems old

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570129813
 
 
   what is “predefined group”？ and do you want to implement this predefined group in the first phase?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] khadgarmage commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

khadgarmage commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569948575
 
 
   > > If the WorkerServer died after receiving a task, MasterServer will use execution-timeout time to ping WorkerServer to detect liveness. If ping failed, try another worker node. In this case, task may execute more than once.
   > 
   > I have a question about this, why not use zookeeper to monitor whether the worker alive or not ? ping method seems old
   
   why not use zookeeper to monitor whether the worker alive or not ? +1
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] qiaozhanwei commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

qiaozhanwei commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-585161125
 
 
   ![副本 架构图 (1)](https://user-images.githubusercontent.com/23756105/74330866-6ff23480-4dcd-11ea-9846-8a770f867f2e.png)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- opened a new issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- opened a new issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658

# Background
WorkerServer executes task by scanning ZK and DB. When WorkerServer starts, it try to retrive the lock in zk, and then executes task by loading data from DB. This is not nice for distributing system, and the current implementation will result in delay executing task.

# Suggestion
We wanna use tcp channel to refactor WorkerServer.

# General Implementation Idea
1. Using Netty for our tcp framework.
2. MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
3. WorkerServer will start up as predefined group and register itself to zk node.
4. WorkerServer will start a tcp server listening port for executing task instead of scanning ZK and DB.
5. Executing result will send back to the MasterServer node using the previous channel.

# General Failover Idea
1. For WorkerServer, only it receives the task command and gives back the ack command to keep the task is acknowledged.
2. If the WorkerServer executes the task normally, it will send back the result by the previous channel.
3. If the WorkerServer died after receiving a task, MasterServer will use execution-timeout time to ping WorkerServer to detect liveness. If ping failed, try another worker node. In this case, task may execute more than once.
4. If the MasterServer died after sending out the a task, WorkerServer will retry to rebuild the channel with N times to the original MasterServer. If failed after retry times, choose a new MasterServer to send back the result. New MasterServer will analysis the task, decide the next process. (Stop or continue execute by instanceId/processId, or just update the status)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570057952
 
 
   > > zookeeper
   > 
   > I said : WorkerServer starts up listening a port and registers itself to zk. So when MasterServer schedules a task , it can get the worker list . But when MasterServer sends out a timeout with 5s task, and WorkerServer crashes after giving back acknowledge. In this case, we have to ping the workerServer after 5s, not relies on zk.
   > Let talk about ping : Ping is a type msg with empty body in self defined binary protocol in most Rpc framework . Not ping command in OS system.
   
   I think zk can do this better,  if worker server down, zk can handle it immediately through listener/watch.   I assume you want to remove dependency about zk?  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570015840
 
 
   > Regarding the third point of Failover, my consideration is this. When the MasterServer before assigning tasks to WorkerServer, the first step is to insert the task into the DB to generate an id, then send the task id to the WorkerServer for execution.
   > 
   > It is assumed that the WorkerServer dies or the network overlaps after receiving the task, the MasterServer does not receive a task execution heartbeat from the WorkerServer within a certain period of time, it indicates that the task execution failed, and the MasterServer modifies the task status in the DB to a failed state.
   > 
   > After that, if the network recovers and receives the heartbeat of the task that has been marked as failed before, the MasterServer directly sends a task termination command to the WorkerServer.
   > 
   > If the user sets the number of retries, the task is retried in the MasterServer, and if not, an alert is send.
   
   Yes, very good . we should do like what you said.
   Task has many different status in scheduling system, we have to insert DB before scheduling.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570015767
 
 
   > > > > MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
   > 
   > Whether or not to consider If the worker is out of memory or system resource?
   
   Yes, not in this phase. As tcp channel comes, some more functions should be taken into consideration.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] lenboo commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

lenboo commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569908922
 
 
   
   - MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
   what's the mean 'directly sends it to target worker using RoundRobin policy.'
   
   - WorkerServer will start up as predefined group and register itself to zk node.
   what's the ‘predefined group’
   
   
   
   - WorkerServer will start a tcp server listening port for executing task instead of scanning ZK and DB.
   - Executing result will send back to the MasterServer node using the previous channel
   
   all of above process need stable communication network,  what should to do in unstable situation?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569931594
 
 
   > * MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
   >   what's the mean 'directly sends it to target worker using RoundRobin policy.'
   RoundRobin's chinese name is "轮询"
   
   > * WorkerServer will start up as predefined group and register itself to zk node.
   >   what's the ‘predefined group’
   I'm also curious，haha
   
   > all of above process need stable communication network, what should to do in unstable situation?
   netty can easily implement this
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570215774
 
 
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong removed a comment on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong removed a comment on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570129813
 
 
   what is “predefined group”？ and do you want to implement this predefined group in the first phase?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- closed issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- closed issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong edited a comment on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong edited a comment on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569931594
 
 
   > MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
   >   what's the mean 'directly sends it to target worker using RoundRobin policy.'
   
   
   RoundRobin's chinese name is "轮询"
   
   > WorkerServer will start up as predefined group and register itself to zk node.
   >   what's the ‘predefined group’
   
   I'm also curious，haha
   
   > all of above process need stable communication network, what should to do in unstable situation?
   
   netty can easily implement this
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong edited a comment on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong edited a comment on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569931594
 
 
   > * MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
   >   what's the mean 'directly sends it to target worker using RoundRobin policy.'
   
   
   RoundRobin's chinese name is "轮询"
   
   > * WorkerServer will start up as predefined group and register itself to zk node.
   >   what's the ‘predefined group’
   
   I'm also curious，haha
   
   > all of above process need stable communication network, what should to do in unstable situation?
   
   netty can easily implement this
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] qiaozhanwei commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

qiaozhanwei commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570128773
 
 
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] khadgarmage commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

khadgarmage commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569949051
 
 
   > > > MasterServer keeps the current logic and when it picks a task, directly sends it to target worker using RoundRobin policy.
   
   Whether or not to consider If the worker is out of memory or system resource?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] elonlo commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

elonlo commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569963277
 
 
   Regarding the third point of Failover, my consideration is this. When the MasterServer before assigning tasks to WorkerServer, the first step is to insert the task into the DB to generate an id, then send the task id to the WorkerServer for execution. 
   
   It is assumed that the WorkerServer dies or the network overlaps after receiving the task, the MasterServer does not receive a task execution heartbeat from the WorkerServer within a certain period of time, it indicates that the task execution failed, and the MasterServer modifies the task status in the DB to a failed state.
   
   After that, if the network recovers and receives the heartbeat of the task that has been marked as failed before, the MasterServer directly sends a task termination command to the WorkerServer.
   
   If the user sets the number of retries, the task is retried in the MasterServer, and if not, an alert is send.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-570015600
 
 
   > zookeeper
   
   I said : WorkerServer starts up listening a port and registers itself to zk. So when MasterServer schedules a task , it can get the worker list .  But when MasterServer sends out a timeout with 5s task, and WorkerServer crashes after giving back acknowledge. In this case, we have to ping the workerServer after 5s, not relies on zk.
   Let talk about ping :   Ping is a type msg with empty body in self defined binary protocol in most Rpc framework . Not ping command in OS system.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] Technoboy- commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

Technoboy- commented on issue #1658: Refactor WorkerServer
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-569891456
 
 
   If the suggestion takes into consideration, I will lead the job 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-dolphinscheduler] dailidong commented on issue #1658: Refactor WorkerServer

Posted by GitBox <gi...@apache.org>.

dailidong commented on issue #1658:
URL: https://github.com/apache/incubator-dolphinscheduler/issues/1658#issuecomment-617204806


   I think this topic has been done, so I will close this issue


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org