You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by angers zhu <an...@gmail.com> on 2021/08/31 14:54:50 UTC
Discuss about current yarn client mode problem
Hi devs,
In current yarn-client mode, we have several problem,
1. When AM lost connection with driver, it will just finish
application with final status of SUCCESS, then
YarnClientSchedulerBackend.MonitorThread will got application status with
SUCCESS final status and then call sc.stop(). SparkContext stoped and
program exit with a 0 exit code. For scheduler system, always use the exit
code to judge if the application is successful. This make scheduler system
and user don't know the job is failed.
2. In YarnClientSchedulerBackend.MonitorThread, even it got a yarn
report with FAILED or KILLED final status. It just call sc.stop(), make
program exit with code 0. When some user killed a wrong application, the
real owner of the killed application still got a wrong SUCCESS status of it
's job.
There are some history discuss on these two problem SPARK-3627
<https://issues.apache.org/jira/browse/SPARK-3627> SPARK-1516
<https://issues.apache.org/jira/browse/SPARK-1516>. But that was the result
of a very early discussion. Now spark is widely used by various companies,
and a lot of spark-related job scheduling systems have been developed
accordingly. These problem make user confused and hard to manage their
jobs.
Hope to get more feedback from the developers, or is there any good way to
avoid these problems.
Below are some of my related pr about these two problems:
https://github.com/apache/spark/pull/33780
https://github.com/apache/spark/pull/33780
Best regards
Angers