You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sahil Takiar (JIRA)" <ji...@apache.org> on 2018/03/08 01:07:00 UTC

[jira] [Commented] (HIVE-18869) Improve SparkTask OOM Error Parsing Logic

    [ https://issues.apache.org/jira/browse/HIVE-18869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16390546#comment-16390546 ] 

Sahil Takiar commented on HIVE-18869:
-------------------------------------

When testing this, this is the error the Spark throws when an executor dies due to OOM issues:

{code}
ERROR : Job failed with org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): ExecutorLostFailure (executor 65 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520386678430_0109_01_000068 on host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 14:36:02.029]Container killed on request. Exit code is 143
java.util.concurrent.ExecutionException: Exception thrown by job
        at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
        at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
        at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:364)
        at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:325)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 452 in stage 18.0 failed 4 times, most recent failure: Lost task 452.3 in stage 18.0 (TID 1532, vc0540.halxg.cloudera.com, executor 65): ExecutorLostFailure (executor 65 exited caused by one of the running tasks) Reason: Container marked as failed: container_1520386678430_0109_01_000068 on host: vc0540.halxg.cloudera.com. Exit status: 143. Diagnostics: [2018-03-07 14:36:02.029]Container killed on request. Exit code is 143
[2018-03-07 14:36:02.029]Container exited with a non-zero exit code 143.
[2018-03-07 14:36:02.029]Killed by external signal
{code}

Might be good to differentiate {{ExecutorLost}} failures too.

> Improve SparkTask OOM Error Parsing Logic
> -----------------------------------------
>
>                 Key: HIVE-18869
>                 URL: https://issues.apache.org/jira/browse/HIVE-18869
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Spark
>            Reporter: Sahil Takiar
>            Priority: Major
>
> The method {{SparkTask#isOOMError}} parses a stack-trace to check if it is due to an OOM error. A few improvements could be made:
> * Differentiate between driver OOM and task OOM
> * The string {{Container killed by YARN for exceeding memory limits}} is printed if a container exceeds its memory limits, but Spark tasks can OOM for other reasons, such as {{GC overhead limit exceeded}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)