You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Subroto Sanyal (JIRA)" <ji...@apache.org> on 2016/06/14 05:09:57 UTC

[jira] [Created] (SPARK-15937) Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish

Subroto Sanyal created SPARK-15937:
--------------------------------------

             Summary: Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish
                 Key: SPARK-15937
                 URL: https://issues.apache.org/jira/browse/SPARK-15937
             Project: Spark
          Issue Type: Bug
    Affects Versions: 1.6.1
            Reporter: Subroto Sanyal


h5. Problem:
Spark Job fails in yarn-cluster mode if the job takes less time than 10 seconds. The job execution here is successful but, spark framework declares it failed.
{noformat}
16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1465791692084_0078_000001
16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subroto); users with modify permissions: Set(subroto)
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization
16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry on cluster...
16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.big-decimal.precision=32
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.default-timezone=US/Eastern
16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from US/Eastern to US/Eastern
16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001 -XX:MaxPermSize=256m 
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties' (default classpath)
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before Plugin Registry start-up: 847.5 MB
16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task configuration...
16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop task to DEBUG: 
16/06/13 10:50:35 INFO cluster.JobProcessor: Processing JobInput{_jobName=Import job (76): BookorderHS2ImportJob_SparkCluster#import(Identity)}
16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
16/06/13 10:50:35 DEBUG hdfs.DFSClient: /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: masked=rw-r--r--
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #2
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #2
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, chunkSize=516, chunksPerPacket=127, packetSize=65532
16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new packet seqno=0, src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #3
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #3
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 10.126.43.144:1004
16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 510884
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock: 330
16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1449877
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #4
16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #4
16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 500000 ms. Please check earlier log output for errors. Failing the application.
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1465791692084_0078
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #5
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #5
16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: closed
16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: stopped, remaining connections 0
{noformat}

h5. Root Cause
Race condition in the logic of method  _org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_ to figure out if the SparkContext is available or not is not sufficient and it doesn't respect the fact job can be finished within 10 seconds.
From the above log snippet (log of App Master) we can see that user class is done executing and marks as finished with Spp status as _SUCCEEDED_ and exitCode as _0_ but, immediately after that there is an error mentioning that SparkContext was't initialised after 500 seconds(the job here actually takes little less than 5 seconds to complete)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org