You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@dolphinscheduler.apache.org by GitBox <gi...@apache.org> on 2022/11/22 09:05:53 UTC

[GitHub] [dolphinscheduler] hzyangkai opened a new issue, #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

hzyangkai opened a new issue, #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Description
   
   At present, no matter what type of task, when worker or master crashes, a new task is killed and restarted, which is unreasonable for tasks running on yarn。
   A reasonable form should be :
   1. if task running in external resource manager e.g. yarn , ShellCommandExecutor should exit immediately after submitting task and getting appid, then worker report appid to master,  at the same time,  worker starts to monitor the task status with appid
   2. when worker crashes,  master should send the same task with appid to another worker, then the worker starts to monitor the same task
   3. when master crashes , master should try rebuild the channel with the worker
   4. For tasks that run locally in the worker, keep the original logic
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] hzyangkai commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
hzyangkai commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1327155090

   Hi, @Radeity 
   
   Thanks for your reminder , shell tasks can indeed be both type 1 and type 2. 
   
   As I mentioned in the current goal, when worker crashes, only tasks of type 3 (such as SparkTask in Cluster Mode) will keep running, while other types of tasks will keep their original logic, restarting after killing. 
   
   in other words, I will ensure that only for tasks of type 3(At present, I only plan to adjust Spark Task in cluster mode), submission process exits immediately after submission. The submission logic of other types of tasks remains unchanged, but they do not restart when only the master crashes. 
   
   Thanks for the your advice. I will think carefully to ensure that all types of yarn tasks and all modes can work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] Radeity commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
Radeity commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1326498204

   Hi @hzyangkai, thanks for your detailed design!
   
   Firstly, i strongly agree with you that all types of tasks should run as type 3 which can make DS behave more like a scheduler and without regard to how task run.
   
   Then, in my view, it's actually an ideal fault tolerance strategy and better than current way. However, i have a question that how can you exit submit process after fetching appId immediately and how to separate `submitApplication` and `monitorApplication` ?
   
   In addition, current way to fetch appId is to parse from log file when finish executing or by `GET_APP_ID_REQUEST` request. Although i provide an aop way to collect appId(https://github.com/apache/dolphinscheduler/pull/12197), it’s not suitable for all yarn tasks such as submitting yarn job on remote host by client mode like Beeline which you have to consider, either. If you can separate two steps mentioned above, you can collect appId more efficiently than current way.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] Radeity commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
Radeity commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1323617005

   Can you give detailed design?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] liaoyt commented on issue #12968: [Feature][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by "liaoyt (via GitHub)" <gi...@apache.org>.
liaoyt commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1490231659

   Is this PR still in progress?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
SbloodyS commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1326025426

   > #9664 was a long time ago, There seems to be nothing to block this issue from 3.1.0. It seems to be an important feature.Can this issue be raised to a higher priority? I wish I could submit a pr.
   
   You are very welcome to submit pr.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] hzyangkai commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
hzyangkai commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1326996778

   Hi @Radeity, thanks for your review.
   
   How to separate submitApplication and monitorApplication is realized by the submit script of the computing engine. For the computing engine, such as flink or yarn, after submitting tasks to yarn, appid will be printed in the log immediately, and it will also provide parameters to control exiting the submitting process or blocking until the task is finished. For example, we can use "spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.submit.waitAppCompletion=false" in spark.  After a task is submitted to yarn, the submitting process immediately prints the appid to the log and exits automatically.  The submitting process does not  wait for the task to end. Then ds can start a thread(or reuse the task-execute-thread) of monitorApplication that polls the status of the app in yarn based on the appid.
   
   For tasks submitted in spark client mode, such as "spark-submit --master yarn --deploy-mode client" or "spark-sql --master yarn" script, we can not separate the submission process and the monitoring process, because the end of the client process means the end of the entire application, and a wrapper is required to submit the sql task in cluster mode.
   
   For tasks submitted using beeline, taking spark as an example, usually , the task is submitted to a thrift server and then the thrift server runs the job in a shared yarn application. The appid is usually shared by many jobs. We need to focus on how to submit job in detached mode and get the jobid , this should probably be reported to client like beeline by thrifter server, and client like beeline can query job status by jobid. As far as I know, beeline should be able to submit sql in detached mode, but may not print jobid, which should be given more thought in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] Radeity commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
Radeity commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1327096832

   Hi, @hzyangkai 
   
   You have to figure out all types of yarn tasks and all modes can work. BTW, your classification of type 1 is kind of wrong, shell task can submit yarn job, either, simply by writing submit command in shell script and submit yarn job by shell task instead of spark task.
   
   It seems hard to finish in one step, please think carefully! Looking forward to your PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] hzyangkai commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
hzyangkai commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1324954480

   Thank you for your reply. Please see the attachment for the detailed design.
   [DolphinScheduler's detailed design of failover.pdf](https://github.com/apache/dolphinscheduler/files/10075229/DolphinScheduler.s.detailed.design.of.failover.pdf)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] SbloodyS commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
SbloodyS commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1325902613

   Duplicated with #9664.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] github-actions[bot] commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1323332656

   Thank you for your feedback, we have received your issue, Please wait patiently for a reply.
   * In order for us to understand your request as soon as possible, please provide detailed information、version or pictures.
   * If you haven't received a reply for a long time, you can [join our slack](https://s.apache.org/dolphinscheduler-slack) and send your question to channel `#troubleshooting`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [dolphinscheduler] hzyangkai commented on issue #12968: [Improvement][Master] task running in external resource manager (e.g. yarn ) should keep running whenever master or worker crashes when doing failover

Posted by GitBox <gi...@apache.org>.
hzyangkai commented on issue #12968:
URL: https://github.com/apache/dolphinscheduler/issues/12968#issuecomment-1326024749

   #9664 was a long time ago, There seems to be nothing to block this issue from 3.1.0. It seems to be an important feature.Can this issue be raised to a higher priority? I wish I could submit a pr.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@dolphinscheduler.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org