You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2015/12/01 00:18:11 UTC
[jira] [Commented] (SPARK-11801) Notify driver when OOM is thrown before executor JVM is killed

    [ https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032699#comment-15032699 ] 

Imran Rashid commented on SPARK-11801:
--------------------------------------

Hi [~vsr],

Thanks for reporting and working on this.  This is a pretty tricky issue so I'd like to have a thorough discussion of it here.  First, I think it would help to clarify exactly what we're trying to do here:

1) After an OOM, there are no "spurious" failure msgs (eg., about directories not existing).  All error messages should clearly indicate it was from an OOM.
2) Its clear that there was an OOM from looking at any of (a) the UI (b) the driver logs (c) the executor logs.  In fact, I'd much rather have clear error messages on the driver -- cleaning up the executor is a bonus, since the user is going to look at the driver first.
3) All tasks that fail because of the OOM clearly indicate that they hit an OOM.  that is, if you have 16 tasks running concurrently, though only one of them gets the OOM, really they all fail from OOM.  I don't even think if it helps at all to even distinguish the original task that gets the OOM vs the other tasks that get killed later, but I don't have a really strong opinion.

One thing which seems unusual here is that when you run under yarn, you automatically get {{-XX:OnOutOfMemoryError='kill %p'}} added to the executor args, but it is *not* added with any of the other cluster managers.  So the first question is to understand if that is really intentional -- why isn't that included for all cluster manager?  Do real users of the standalone cluster manager just always add those args in themselves?  And if that distinction really is intentional, then we need to make sure this approach works with any cluster manager.  (eg., it should be tested on at least yarn and standalone.)

It seems that {{-XX:OnOutOfMemoryError='kill %p'}} was introduced for yarn from the initial version -- I'm not sure if it was a conscious decision to have it differ between the different approaches.
https://github.com/apache/spark/commit/d90d2af1036e909f81cf77c85bfe589993c4f9f3

Its worth noting that in all environments, there is already *some* handling for the OOM, via the uncaught exception handler, which is [added to the main executor threads | https://github.com/apache/spark/blob/de64b65f7cf2ac58c1abc310ba547637fdbb8557/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76] and even [invoked sometimes when the exception is caught | https://github.com/apache/spark/blob/de64b65f7cf2ac58c1abc310ba547637fdbb8557/core/src/main/scala/org/apache/spark/executor/Executor.scala#L317].  However, I assume that relying on {{{{-XX:OnOutOfMemoryError='kill %p'}} is still a better idea, since the OOM could occur in some thread we haven't installed the uncaught exception handler, and it also seems safer to rely on the jvm to do this itself.  But the downside is that it seems right now, when running under yarn, the {{kill %p}} is triggering the shutdown hooks before even the first OOM failure gets sent back to the driver some of the time.

Trying to ping some folks who might have an idea on why the cluster managers differ (and to confirm my reading of the code): [~andrewor14] [~tgraves] [~mridulm@yahoo-inc.com] [~vanzin]

Keeping all of that in mind, in general I'm in favor of this approach.  I think its impossible to guarantee that we get the perfect messages in all cases, but we can make a best effort to improve the error handling in most cases.

> Notify driver when OOM is thrown before executor JVM is killed 
> ---------------------------------------------------------------
>
>                 Key: SPARK-11801
>                 URL: https://issues.apache.org/jira/browse/SPARK-11801
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.1
>            Reporter: Srinivasa Reddy Vundela
>            Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with kill %p. It is unclear in driver logs/Spark UI why the task is lost or executor is lost. Customer has to look into the executor logs to see OOM is the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org