You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2019/01/03 09:59:00 UTC

[jira] [Assigned] (SPARK-26524) If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely.

     [ https://issues.apache.org/jira/browse/SPARK-26524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-26524:
------------------------------------

    Assignee: Apache Spark

> If the application directory fails to be created on the SPARK_WORKER_DIR on some woker nodes (for example, bad disk or disk has no capacity), the application executor will be allocated indefinitely.
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-26524
>                 URL: https://issues.apache.org/jira/browse/SPARK-26524
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: hantiantian
>            Assignee: Apache Spark
>            Priority: Major
>
> When the spark worker is started, the workerdir is created successfully. When the application is submitted, the disks mounted by the workerdir and worker122 workerdir are damaged.
> When a worker allocates an executor, it creates a working directory and a temporary directory. If the creation fails, the executor allocation fails. The application directory fails to be created on the SPARK_WORKER_DIR on woker121 and worker122,the application executor will be allocated indefinitely.
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5762 because it is FAILED
> 2019-01-03 15:50:00,525 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5765 on worker worker-20190103154858-worker121-37199
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5764 because it is FAILED
> 2019-01-03 15:50:00,526 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5766 on worker worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5766 because it is FAILED
> 2019-01-03 15:50:00,527 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5767 on worker worker-20190103154920-worker122-41273
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Removing executor app-20190103154954-0000/5765 because it is FAILED
> 2019-01-03 15:50:00,532 INFO org.apache.spark.deploy.master.Master: Launching executor app-20190103154954-0000/5768 on worker worker-20190103154858-worker121-37199
> ...
> I observed the code and found that spark has some processing for the failure of the executor allocation. However, it can only handle the case where the current application does not have an executor that has been successfully assigned.
> if (!normalExit
>  && appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
>  && MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
>  val execs = appInfo.executors.values
>  if (!execs.exists(_.state == ExecutorState.RUNNING)) {
>  logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
>  s"${appInfo.retryCount} times; removing it")
>  removeApplication(appInfo, ApplicationState.FAILED)
>  }
> }
> Master will only judge whether the worker is available according to the resources of the worker. 
> // Filter out workers that don't have enough resources to launch an executor
> val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)
>  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&
>  worker.coresFree >= coresPerExecutor)
>  .sortBy(_.coresFree).reverse
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org