You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/12/10 22:51:00 UTC

[jira] [Commented] (SPARK-25869) Spark on YARN: the original diagnostics is missing when job failed maxAppAttempts times

    [ https://issues.apache.org/jira/browse/SPARK-25869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715709#comment-16715709 ] 

ASF GitHub Bot commented on SPARK-25869:
----------------------------------------

vanzin commented on a change in pull request #22876: [SPARK-25869] [YARN] the original diagnostics is missing when job failed ma…
URL: https://github.com/apache/spark/pull/22876#discussion_r240412098
 
 

 ##########
 File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 ##########
 @@ -293,6 +293,9 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments) extends
         }
 
         if (!unregistered) {
+          logInfo("Waiting for " + sparkConf.get("spark.yarn.report.interval", "1000").toInt +"ms to unregister am," +
 
 Review comment:
   This should also be a config constant. Instead of sleeping might be better to join `userClassThread` or `reporterThread` since they may exit more quickly than the configured wait.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Spark on YARN: the original diagnostics is missing when job failed maxAppAttempts times
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-25869
>                 URL: https://issues.apache.org/jira/browse/SPARK-25869
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.1.1
>            Reporter: Yeliang Cang
>            Priority: Major
>
> When configure spark on yarn, I submit job using below command:
> {code}
>  spark-submit  --class org.apache.spark.examples.SparkPi     --master yarn     --deploy-mode cluster     --driver-memory 127m  --driver-cores 1   --executor-memory 2048m     --executor-cores 1    --num-executors 10  --queue root.mr --conf spark.testing.reservedMemory=1048576 --conf spark.yarn.executor.memoryOverhead=50 --conf spark.yarn.driver.memoryOverhead=50 /opt/ZDH/parcels/lib/spark/examples/jars/spark-examples* 10000
> {code}
> Apparently, the driver memory is not enough, but this can not be seen in spark client log:
> {code}
> 2018-10-29 19:28:34,658 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0013 (state: ACCEPTED)
> 2018-10-29 19:28:35,660 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0013 (state: RUNNING)
> 2018-10-29 19:28:35,660 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: 10.43.183.143
>  ApplicationMaster RPC port: 0
>  queue: root.mr
>  start time: 1540812501560
>  final status: UNDEFINED
>  tracking URL: http://zdh141:8088/proxy/application_1540536615315_0013/
>  user: mr
> 2018-10-29 19:28:36,663 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0013 (state: FINISHED)
> 2018-10-29 19:28:36,663 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: Shutdown hook called before final status was reported.
>  ApplicationMaster host: 10.43.183.143
>  ApplicationMaster RPC port: 0
>  queue: root.mr
>  start time: 1540812501560
>  final status: FAILED
>  tracking URL: http://zdh141:8088/proxy/application_1540536615315_0013/
>  user: mr
> Exception in thread "main" org.apache.spark.SparkException: Application application_1540536615315_0013 finished with failed status
>  at org.apache.spark.deploy.yarn.Client.run(Client.scala:1137)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1183)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-10-29 19:28:36,694 INFO org.apache.spark.util.ShutdownHookManager: Shutdown hook called
> 2018-10-29 19:28:36,695 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/spark-96077be5-0dfa-496d-a6a0-96e83393a8d9
> {code}
>  
>  
> Solution: after apply the patch, spark client log can be shown as:
> {code}
> 2018-10-29 19:27:32,962 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0012 (state: RUNNING)
> 2018-10-29 19:27:32,962 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: 10.43.183.143
>  ApplicationMaster RPC port: 0
>  queue: root.mr
>  start time: 1540812436656
>  final status: UNDEFINED
>  tracking URL: http://zdh141:8088/proxy/application_1540536615315_0012/
>  user: mr
> 2018-10-29 19:27:33,964 INFO org.apache.spark.deploy.yarn.Client: Application report for application_1540536615315_0012 (state: FAILED)
> 2018-10-29 19:27:33,964 INFO org.apache.spark.deploy.yarn.Client:
>  client token: N/A
>  diagnostics: Application application_1540536615315_0012 failed 2 times due to AM Container for appattempt_1540536615315_0012_000002 exited with exitCode: -104
> For more detailed output, check application tracking page:http://zdh141:8088/cluster/app/application_1540536615315_0012Then, click on links to logs of each attempt.
> Diagnostics: virtual memory used. Killing container.
> Dump of the process-tree for container_e53_1540536615315_0012_02_000001 :
>  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>  |- 1532 1528 1528 1528 (java) 1209 174 3472551936 65185 /usr/java/jdk/bin/java -server -Xmx127m -Djava.io.tmpdir=/data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/tmp -Xss32M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=512M -Dspark.yarn.app.container.log.dir=/data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class org.apache.spark.examples.SparkPi --jar file:/opt/ZDH/parcels/lib/spark/examples/jars/spark-examples_2.11-2.2.1-zdh8.5.1.jar --arg 10000 --properties-file /data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/__spark_conf__/__spark_conf__.properties
>  |- 1528 1526 1528 1528 (bash) 0 0 108642304 309 /bin/bash -c LD_LIBRARY_PATH=/opt/ZDH/parcels/lib/hadoop/lib/native: /usr/java/jdk/bin/java -server -Xmx127m -Djava.io.tmpdir=/data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/tmp '-Xss32M' '-XX:MetaspaceSize=128M' '-XX:MaxMetaspaceSize=512M' -Dspark.yarn.app.container.log.dir=/data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.examples.SparkPi' --jar file:/opt/ZDH/parcels/lib/spark/examples/jars/spark-examples_2.11-2.2.1-zdh8.5.1.jar --arg '10000' --properties-file /data3/zdh/yarn/local/usercache/mr/appcache/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/__spark_conf__/__spark_conf__.properties 1> /data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/stdout 2> /data1/zdh/yarn/logs/userlogs/application_1540536615315_0012/container_e53_1540536615315_0012_02_000001/stderr
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> PmemUsageMBsMaxMBs is: 255.0 MBFailing this attempt. Failing the application.
>  ApplicationMaster host: N/A
>  ApplicationMaster RPC port: -1
>  queue: root.mr
>  start time: 1540812436656
>  final status: FAILED
>  tracking URL: http://zdh141:8088/cluster/app/application_1540536615315_0012
>  user: mr
> 2018-10-29 19:27:34,542 INFO org.apache.spark.deploy.yarn.Client: Deleted staging directory hdfs://nameservice/user/mr/.sparkStaging/application_1540536615315_0012
> Exception in thread "main" org.apache.spark.SparkException: Application application_1540536615315_0012 finished with failed status
>  at org.apache.spark.deploy.yarn.Client.run(Client.scala:1137)
>  at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1183)
>  at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
>  at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-10-29 19:27:34,548 INFO org.apache.spark.util.ShutdownHookManager: Shutdown hook called
> 2018-10-29 19:27:34,549 INFO org.apache.spark.util.ShutdownHookManager: Deleting directory /tmp/spark-ce35f2ad-ec1f-4173-9441-163e2482ed61
> {code}
> Now we can see the true reason for job failure from client!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org