You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Abhishek Choudhary (JIRA)" <ji...@apache.org> on 2015/04/22 12:45:58 UTC
[jira] [Created] (SPARK-7054) Spark joobs hang for ~15 mins when a node goes down

Abhishek Choudhary created SPARK-7054:
-----------------------------------------

             Summary: Spark joobs hang for ~15 mins when a node goes down
                 Key: SPARK-7054
                 URL: https://issues.apache.org/jira/browse/SPARK-7054
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.2.1
         Environment: Cent OS - 6 ,Java 8
            Reporter: Abhishek Choudhary
            Priority: Blocker


In a four node cluster (on VMs) having 2 Namenodes and 2 Datanodes with 10 executors (Yarn 2.4) Spark jobs are running in yarn-client mode. When a running vm is shut down, spark job hangs for ~15 mins .

After ~45-50 seconds driver got information of lost block managers,
>From logs : 

2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(9, ACUME-DN2, 40898) with no recent heart beats: 59674ms exceeds 45000ms
2015-04-22 09:50:30,000 [sparkDriver-akka.actor.default-dispatcher-15] WARN  org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(5, ACUME-DN2, 37947) with no recent heart beats: 60044ms exceeds 45000ms
2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(3, ACUME-DN2, 49808) with no recent heart beats: 54637ms exceeds 45000ms
2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(1, ACUME-DN2, 44090) with no recent heart beats: 59049ms exceeds 45000ms
2015-04-22 09:50:30,001 [sparkDriver-akka.actor.default-dispatcher-15] WARN  org.apache.spark.storage.BlockManagerMasterActor - Removing BlockManager BlockManagerId(7, ACUME-DN2, 47267) with no recent heart beats: 56879ms exceeds 45000ms


After ~15 mins Spark driver got executor lost event and rescheduled failed tasks

>From logs :

2015-04-22 10:05:04,965 [sparkDriver-akka.actor.default-dispatcher-19] ERROR org.apache.spark.scheduler.cluster.YarnClientClusterScheduler - Lost executor 1 on ACUME-DN2: remote Akka client disassociated

For these 15 mins all the jobs were stuck for executors running on shutdown vm .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org