You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sahil Takiar (JIRA)" <ji...@apache.org> on 2018/03/08 01:07:00 UTC
[jira] [Commented] (HIVE-18869) Improve SparkTask OOM Error Parsing
Logic
[ https://issues.apache.org/jira/browse/HIVE-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390546#comment-16390546 ]
Sahil Takiar commented on HIVE-18869:
-------------------------------------
When testing this, this is the error the Spark throws when an executor dies due to OOM issues:
{code}
ERROR : Job failed with org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): ExecutorLostFailure (executor 65 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520386678430_0109_01_000068 on host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 14:36:02.029]Container killed on request. Exit code is 143
java.util.concurrent.ExecutionException: Exception thrown by job
at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:364)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:325)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): ExecutorLostFailure (executor 65 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520386678430_0109_01_000068 on host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 14:36:02.029]Container killed on request. Exit code is 143
[2018-03-07 14:36:02.029]Container exited with a non-zero exit code 143.
[2018-03-07 14:36:02.029]Killed by external signal
{code}
Might be good to differentiate {{ExecutorLost}} failures too.
> Improve SparkTask OOM Error Parsing Logic
> -----------------------------------------
>
> Key: HIVE-18869
> URL: https://issues.apache.org/jira/browse/HIVE-18869
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Sahil Takiar
> Priority: Major
>
> The method {{SparkTask#isOOMError}} parses a stack-trace to check if it is due to an OOM error. A few improvements could be made:
> * Differentiate between driver OOM and task OOM
> * The string {{Container killed by YARN for exceeding memory limits}} is printed if a container exceeds its memory limits, but Spark tasks can OOM for other reasons, such as {{GC overhead limit exceeded}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)