You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "SuYan (JIRA)" <ji...@apache.org> on 2015/12/18 08:50:46 UTC

[jira] [Created] (SPARK-12419) FetchFailed = false Executor lost should not allowed re-registered in BlockManager Master again?

SuYan created SPARK-12419:
-----------------------------

             Summary: FetchFailed = false Executor lost should not allowed re-registered in BlockManager Master again?
                 Key: SPARK-12419
                 URL: https://issues.apache.org/jira/browse/SPARK-12419
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.5.2
            Reporter: SuYan
            Priority: Minor


In Yarn, I found a container was completed By YarnAllocator(the container was killed by Yarn initiatively due to the disk error), and removed
from BlockManagerMaster.
But after 1 second, due to Yarn not kill it quickly, it re-register to BlockManagerMaster... it looks like unreasonable

I check the code:
fetchFailed=true, it  was reasonable that allow the executor to re-register
FetchFailed=false: heartbeat expire(call sc.killExecutor)/CoarseGrainedSchedulerBackend.RemoveExecutor(), which thought that executor will never come back/ MesosScheduler executor Lost(I not familar with mesos executor Lost event, will allow the executor back again)?  if all fetchFailed=false executorLost are all regard as not come back.... may can prevent it from re-registering in BlockManagerMaster...

Also, it may be a yarn logic improvement, the completedContainers should  be very dead?

Here the logs:

2015-12-14,10:25:00,647 INFO org.apache.spark.deploy.yarn.YarnAllocator: Completed container container_1435709042873_31294_01_208639 (state: COMPLETE, exit status: -100)
2015-12-14,10:25:00,647 INFO org.apache.spark.deploy.yarn.YarnAllocator: Container marked as failed: container_1435709042873_31294_01_208639. Exit status: -100. Diagnostics: Container released on a *lost* node


2015-12-14,10:25:00,667 ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler: Lost executor 84 on XX.XX.XX.109.bj: Yarn deallocated the executor 84 (container container_1435709042873_31294_01_208639)
2015-12-14,10:25:00,667 INFO org.apache.spark.scheduler.TaskSetManager: Re-queueing tasks for 84 from TaskSet 5.0
2015-12-14,10:25:00,670 INFO org.apache.spark.scheduler.ShuffleMapStage: ShuffleMapStage 5 is now unavailable on executor 21 (1926/2600, false)
2015-12-14,10:25:00,674 INFO org.apache.spark.scheduler.DAGScheduler: Resubmitted ShuffleMapTask(5, 504), so marking it as still running
2015-12-14,10:25:00,675 INFO org.apache.spark.scheduler.DAGScheduler: Resubmitted ShuffleMapTask(5, 773), so marking it as still running

2015-12-14,10:25:00,676 INFO org.apache.spark.scheduler.DAGScheduler: Executor lost: 84 (epoch 13)
2015-12-14,10:25:00,676 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Trying to remove executor 84 from BlockManagerMaster.
2015-12-14,10:25:00,677 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(84, XX.XX.XX.109.bj, 44528)
2015-12-14,10:25:00,677 INFO org.apache.spark.storage.BlockManagerMaster: Removed 84 successfully in removeExecutor

2015-12-14,10:25:01,066 INFO org.apache.spark.storage.BlockManagerMasterEndpoint: Registering block manager XX.XX.XX.109.bj:44528 with 706.7 MB RAM, BlockManagerId(84, XX.XX.XX.109.bj, 44528)

2015-12-14,10:25:01,584 INFO org.apache.spark.storage.BlockManagerInfo: Added rdd_20_2278 in memory on XX.XX.XX.109.bj:44528 (size:





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org