You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Subroto Sanyal (JIRA)" <ji...@apache.org> on 2016/06/14 07:10:40 UTC

[jira] [Updated] (SPARK-15937) Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish

     [ https://issues.apache.org/jira/browse/SPARK-15937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Subroto Sanyal updated SPARK-15937:
-----------------------------------
    Priority: Critical  (was: Major)

> Spark declares a succeeding job to be failed in yarn-cluster mode if the job takes very small time (~ < 10 seconds) to finish
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-15937
>                 URL: https://issues.apache.org/jira/browse/SPARK-15937
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 1.6.1
>            Reporter: Subroto Sanyal
>            Priority: Critical
>
> h5. Problem:
> Spark Job fails in yarn-cluster mode if the job takes less time than 10 seconds. The job execution here is successful but, spark framework declares it failed.
> {noformat}
> 16/06/13 10:50:29 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
> 16/06/13 10:50:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> 16/06/13 10:50:31 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1465791692084_0078_000001
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing view acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: Changing modify acls to: subroto
> 16/06/13 10:50:32 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subroto); users with modify permissions: Set(subroto)
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization
> 16/06/13 10:50:32 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Initializing plugin registry on cluster...
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Loading default time zone of US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.big-decimal.precision=32
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.default-timezone=US/Eastern
> 16/06/13 10:50:33 INFO graphv2.ClusterTaskRuntime: Setting system property das.security.conductor.properties.keysLocation=etc/securePropertiesKeys
> 16/06/13 10:50:33 INFO util.DefaultTimeZone: Changing default time zone of from US/Eastern to US/Eastern
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: --- JVM Information ---
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM: Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: JVM arguments: -Xmx1024m -Djava.io.tmpdir=/mnt/hadoop/yarn/usercache/subroto/appcache/application_1465791692084_0078/container_1465791692084_0078_01_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop/yarn/application_1465791692084_0078/container_1465791692084_0078_01_000001 -XX:MaxPermSize=256m 
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Log4j:  'file:/mnt/hadoop/yarn/usercache/subroto/filecache/103/__spark_conf__6826322497897602970.zip/log4j.properties' (default classpath)
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Max memory : 910.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: Free memory: 831.8 MB, before Plugin Registry start-up: 847.5 MB
> 16/06/13 10:50:34 INFO job.PluginRegistryImpl: ---------------------
> 16/06/13 10:50:34 INFO graphv2.ClusterTaskRuntime: Initializing cluster task configuration...
> 16/06/13 10:50:34 INFO util.LoggingUtil: Setting root logger level for hadoop task to DEBUG: 
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Processing JobInput{_jobName=Import job (76): BookorderHS2ImportJob_SparkCluster#import(Identity)}
> 16/06/13 10:50:35 DEBUG security.UserGroupInformation: hadoop login
> 16/06/13 10:50:35 INFO cluster.JobProcessor: Writing job output to hdfs://ip-10-195-43-46.eu-west-1.compute.internal:8020/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827.
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: /user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827: masked=rw-r--r--
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #2
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #2
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: create took 2ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: computePacketChunkSize: src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, chunkSize=516, chunksPerPacket=127, packetSize=65532
> 16/06/13 10:50:35 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [DFSClient_NONMAPREDUCE_1004172348_1] with renew id 1 started
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient writeChunk allocating new packet seqno=0, src=/user/subroto/dap1/temp/Output-19017846-059d-4bf1-a95d-1063fe6c1827, packetSize=65532, chunksPerPacket=127, bytesCurBlock=0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 0
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Queued packet 1
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Allocating new block
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Waiting for ack for: 1
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #3
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #3
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: addBlock took 1ms
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: pipeline = 10.195.43.46:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Connecting to datanode 10.126.43.144:1004
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: Send buf size 131071
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:0 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 0 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 510884
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DataStreamer block BP-88063502-10.195.43.46-1465791630789:blk_1073743637_2813 sending packet packet seqno:1 offsetInBlock:330 lastPacketInBlock:true lastByteOffsetInBlock: 330
> 16/06/13 10:50:35 DEBUG hdfs.DFSClient: DFSClient seqno: 1 status: SUCCESS status: SUCCESS downstreamAckTimeNanos: 1449877
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #4
> 16/06/13 10:50:35 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #4
> 16/06/13 10:50:35 DEBUG ipc.ProtobufRpcEngine: Call: complete took 1ms
> 16/06/13 10:50:35 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
> 16/06/13 10:50:35 DEBUG yarn.ApplicationMaster: Done running users class
> 16/06/13 10:50:42 ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 500000 ms. Please check earlier log output for errors. Failing the application.
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
> 16/06/13 10:50:42 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1465791692084_0078
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto sending #5
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto got value #5
> 16/06/13 10:50:42 DEBUG ipc.ProtobufRpcEngine: Call: delete took 2ms
> 16/06/13 10:50:42 INFO util.ShutdownHookManager: Shutdown hook called
> 16/06/13 10:50:42 DEBUG ipc.Client: Stopping client
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: closed
> 16/06/13 10:50:42 DEBUG ipc.Client: IPC Client (841703792) connection to ip-10-195-43-46.eu-west-1.compute.internal/10.195.43.46:8020 from subroto: stopped, remaining connections 0
> {noformat}
> h5. Root Cause
> Race condition in the logic of method  _org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized()_ to figure out if the SparkContext is available or not is not sufficient and it doesn't respect the fact job can be finished within 10 seconds.
> From the above log snippet (log of App Master) we can see that user class is done executing and marks as finished with Spp status as _SUCCEEDED_ and exitCode as _0_ but, immediately after that there is an error mentioning that SparkContext was't initialised after 500 seconds(the job here actually takes little less than 5 seconds to complete)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org