You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by angers zhu <an...@gmail.com> on 2021/08/31 14:54:50 UTC

Discuss about current yarn client mode problem

Hi devs,

In current yarn-client mode, we have several problem,

   1. When AM lost connection with driver, it will just finish
   application with final status of SUCCESS, then
   YarnClientSchedulerBackend.MonitorThread will got application status with
   SUCCESS final status and then call sc.stop().  SparkContext stoped and
   program exit with a 0 exit code. For scheduler system, always use the exit
   code to judge if the application is successful. This make scheduler system
   and user don't know the job is failed.
   2. In YarnClientSchedulerBackend.MonitorThread, even it got a yarn
   report with FAILED or KILLED final status. It just call sc.stop(), make
   program exit with code 0. When some user killed a wrong application, the
   real owner of the killed application still got a wrong SUCCESS status of it
   's job.

There are some history discuss on these two problem SPARK-3627
<https://issues.apache.org/jira/browse/SPARK-3627> SPARK-1516
<https://issues.apache.org/jira/browse/SPARK-1516>. But that was the result
of a very early discussion. Now spark is widely used by various companies,
and a lot of spark-related job scheduling systems have been developed
accordingly. These problem make user confused and hard to manage their
jobs.

Hope to get more feedback from the developers, or is there any good way to
avoid these problems.

Below are some of my related pr about these two problems:
https://github.com/apache/spark/pull/33780
https://github.com/apache/spark/pull/33780

Best regards
Angers