You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shaoquan Zhang (JIRA)" <ji...@apache.org> on 2018/01/04 12:11:00 UTC

[jira] [Created] (SPARK-22958) Spark is stuck when the only one executor fails to register with driver

Shaoquan Zhang created SPARK-22958:
--------------------------------------

             Summary: Spark is stuck when the only one executor fails to register with driver
                 Key: SPARK-22958
                 URL: https://issues.apache.org/jira/browse/SPARK-22958
             Project: Spark
          Issue Type: Bug
          Components: YARN
    Affects Versions: 2.1.0
            Reporter: Shaoquan Zhang


We have encountered the following scenario. We run a very simple job in yarn cluster mode. This job needs only one executor to complete. In the running, this job was stuck forever.

After checking the job log, we found an issue in the Spark. When executor fails to register with driver, YarnAllocator is blind to know it. As a result, the variable (numExecutorsRunning) maintained by YarnAllocator does not reflect the truth. When this variable is used to allocate resources to the running job, misunderstanding happens. As for our job, the misunderstanding results in forever stuck.

The more details are as follows. The following figure shows how 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org