You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "hantiantian (JIRA)" <ji...@apache.org> on 2019/01/03 09:06:00 UTC

[jira] [Created] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely.

hantiantian created SPARK-26524:
-----------------------------------

             Summary: If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely.
                 Key: SPARK-26524
                 URL: https://issues.apache.org/jira/browse/SPARK-26524
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: hantiantian


When the spark worker is started, the workerdir is created successfully. When the application is submitted, the disks mounted by the workerdir and worker122 workerdir are damaged.

When a worker allocates an executor, it creates a working directory and a temporary directory. If the creation fails, the executor allocation fails. The application directory fails to be created on the SPARK_WORKER_DIR on woker121 and worker122，the application executor will be allocated indefinitely.

2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5762 because it is FAILED
2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5765 on worker worker-20190103154858-worker121-37199
2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5764 because it is FAILED
2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5766 on worker worker-20190103154920-worker122-41273
2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5766 because it is FAILED
2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5767 on worker worker-20190103154920-worker122-41273
2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5765 because it is FAILED
2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5768 on worker worker-20190103154858-worker121-37199

...

I observed the code and found that spark has some processing for the failure of the executor allocation. However, it can only handle the case where the current application does not have an executor that has been successfully assigned.

if (!normalExit
 && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
 && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
 val execs = appInfo.executors.values
 if (!execs.exists(_.state == ExecutorState.RUNNING)) {
 logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
 s"${appInfo.retryCount} times; removing it")
 removeApplication(appInfo, ApplicationState.FAILED)
 }
}

Master will only judge whether the worker is available according to the resources of the worker. 

// Filter out workers that don't have enough resources to launch an executor
val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
 .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
 worker.coresFree >= coresPerExecutor)
 .sortBy(_.coresFree).reverse

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org