You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2015/12/10 03:49:45 UTC

[GitHub] spark pull request: [SPARK-12155] Fix executor OOM in unified memo...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/10240

    [SPARK-12155] Fix executor OOM in unified memory management

    **Problem.** In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills.
    
    **Example.** Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233).
    
    **Solution.** Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap would have been 600B / 2 = 300B.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark fix-oom

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10240.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10240
    
----
commit 68337754b8541d8ac497c19ae531a79a91904708
Author: Andrew Or <an...@databricks.com>
Date:   2015-12-09T23:45:48Z

    Pass in callbacks (gross)

commit 35392f5e5e8c152e8cf516ffb4f7c8def0df8361
Author: Andrew Or <an...@databricks.com>
Date:   2015-12-10T00:02:18Z

    Notify all on task completion

commit cd0c680161e9c6f8044401ec1c2c3a4e83d4b6a1
Author: Andrew Or <an...@databricks.com>
Date:   2015-12-10T02:10:13Z

    Rename silly method names + add detailed comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163651939
  
    **[Test build #2195 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2195/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163645141
  
    **[Test build #2195 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2195/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163766092
  
    **[Test build #2198 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2198/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `        logInfo(s\"HBase class not found $e\")`\n  * `        logDebug(\"HBase class not found\", e)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163486198
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47474/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163741726
  
    **[Test build #47539 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47539/consoleFull)** for PR 10240 at commit [`2929640`](https://github.com/apache/spark/commit/292964002cbbf695b621afd205269b0cb0198c83).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163771456
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47539/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163470293
  
    @JoshRosen @davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163731262
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47528/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163470368
  
    **[Test build #47474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47474/consoleFull)** for PR 10240 at commit [`cd0c680`](https://github.com/apache/spark/commit/cd0c680161e9c6f8044401ec1c2c3a4e83d4b6a1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163486148
  
    **[Test build #47474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47474/consoleFull)** for PR 10240 at commit [`cd0c680`](https://github.com/apache/spark/commit/cd0c680161e9c6f8044401ec1c2c3a4e83d4b6a1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:\n  * `        logInfo(s\"HBase class not found $e\")`\n  * `        logDebug(\"HBase class not found\", e)`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163771850
  
    **[Test build #2197 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2197/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163486196
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163701442
  
    ok, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163741900
  
    **[Test build #2196 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2196/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163704082
  
    **[Test build #47528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47528/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163742209
  
    **[Test build #2198 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2198/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163775281
  
    **[Test build #2196 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2196/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163534534
  
    **[Test build #47491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47491/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47267569
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
    --- End diff --
    
    The current code is hard to understand, I can prove that it's the same with mine one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47192018
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
    --- End diff --
    
    I'd like to change this block as this:
    ```
    if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {
      lock.wait()
    } else {
      memoryForTask(taskAttemptId) += toGrant
      return toGrant
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163741847
  
    **[Test build #2197 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2197/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163738746
  
    @davies please look at the final changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163738626
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163742441
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163701557
  
    Last commit actually passed tests last night.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47201562
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
    --- End diff --
    
    If not, the old code is not correct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163534647
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47268739
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
    --- End diff --
    
    yeah, I agree, though it's something we can always fix separately so we don't block the release. Let's defer the judgment to @JoshRosen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47191802
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
             // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking;
             // if we can't give it this much now, wait for other tasks to free up memory
             // (this happens if older tasks allocated lots of memory before N grew)
    -        if (memoryFree >= math.min(maxToGrant, poolSize / (2 * numActiveTasks) - curMem)) {
    +        if (memoryFree >= math.min(maxToGrant, poolSize / minMemoryPerTask)) {
    --- End diff --
    
    `poolSize / minMemoryPerTask` should be `minMemoryPerTask - curMem`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163731070
  
    **[Test build #47528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47528/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163534648
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47491/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47281042
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -70,11 +70,28 @@ private[memory] class ExecutionMemoryPool(
        * active tasks) before it is forced to spill. This can happen if the number of tasks increase
        * but an older task had a lot of memory already.
        *
    +   * @param numBytes number of bytes to acquire
    +   * @param taskAttemptId the task attempt acquiring memory
    +   * @param maybeGrowPool a callback that potentially grows the size of this pool. It takes in
    +   *                      one parameter (Long) that represents the desired amount of memory by
    +   *                      which this pool should be expanded.
    +   * @param computeMaxPoolSize a callback that returns the maximum allowable size of this pool
    --- End diff --
    
    If you take into account the memory that can be freed then isn't this a fixed value?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47282031
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -70,11 +70,28 @@ private[memory] class ExecutionMemoryPool(
        * active tasks) before it is forced to spill. This can happen if the number of tasks increase
        * but an older task had a lot of memory already.
        *
    +   * @param numBytes number of bytes to acquire
    +   * @param taskAttemptId the task attempt acquiring memory
    +   * @param maybeGrowPool a callback that potentially grows the size of this pool. It takes in
    +   *                      one parameter (Long) that represents the desired amount of memory by
    +   *                      which this pool should be expanded.
    +   * @param computeMaxPoolSize a callback that returns the maximum allowable size of this pool
    --- End diff --
    
    no, because if storage memory used is below a certain mark (default 0.5 of max memory) then it cannot be evicted. In this case the max pool size depends on how much unevictable storage memory there is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163731260
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by nongli <gi...@git.apache.org>.
Github user nongli commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47281672
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -70,11 +70,28 @@ private[memory] class ExecutionMemoryPool(
        * active tasks) before it is forced to spill. This can happen if the number of tasks increase
        * but an older task had a lot of memory already.
        *
    +   * @param numBytes number of bytes to acquire
    +   * @param taskAttemptId the task attempt acquiring memory
    +   * @param maybeGrowPool a callback that potentially grows the size of this pool. It takes in
    +   *                      one parameter (Long) that represents the desired amount of memory by
    +   *                      which this pool should be expanded.
    +   * @param computeMaxPoolSize a callback that returns the maximum allowable size of this pool
    --- End diff --
    
    Nvm, I see now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47281669
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -70,11 +70,28 @@ private[memory] class ExecutionMemoryPool(
        * active tasks) before it is forced to spill. This can happen if the number of tasks increase
        * but an older task had a lot of memory already.
        *
    +   * @param numBytes number of bytes to acquire
    +   * @param taskAttemptId the task attempt acquiring memory
    +   * @param maybeGrowPool a callback that potentially grows the size of this pool. It takes in
    +   *                      one parameter (Long) that represents the desired amount of memory by
    +   *                      which this pool should be expanded.
    +   * @param computeMaxPoolSize a callback that returns the maximum allowable size of this pool
    --- End diff --
    
    The actual used memory by storage could be changed, so it's not a fixed value


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163771449
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47193976
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
             // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking;
             // if we can't give it this much now, wait for other tasks to free up memory
             // (this happens if older tasks allocated lots of memory before N grew)
    -        if (memoryFree >= math.min(maxToGrant, poolSize / (2 * numActiveTasks) - curMem)) {
    +        if (memoryFree >= math.min(maxToGrant, poolSize / minMemoryPerTask)) {
    --- End diff --
    
    oops good catch!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10240


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163771007
  
    **[Test build #47539 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47539/consoleFull)** for PR 10240 at commit [`2929640`](https://github.com/apache/spark/commit/292964002cbbf695b621afd205269b0cb0198c83).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47199260
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
    --- End diff --
    
    that looks fine, but is it completely equivalent to the old code? It's not 100% clear to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10240#discussion_r47275498
  
    --- Diff: core/src/main/scala/org/apache/spark/memory/ExecutionMemoryPool.scala ---
    @@ -91,23 +108,34 @@ private[memory] class ExecutionMemoryPool(
           val numActiveTasks = memoryForTask.keys.size
           val curMem = memoryForTask(taskAttemptId)
     
    -      // How much we can grant this task; don't let it grow to more than 1 / numActiveTasks;
    -      // don't let it be negative
    -      val maxToGrant =
    -        math.min(numBytes, math.max(0, (poolSize / numActiveTasks) - curMem))
    +      // In every iteration of this loop, we should first try to reclaim any borrowed execution
    +      // space from storage. This is necessary because of the potential race condition where new
    +      // storage blocks may steal the free execution memory that this task was waiting for.
    +      maybeGrowPool(numBytes - memoryFree)
    +
    +      // Maximum size the pool would have after potentially growing the pool.
    +      // This is used to compute the upper bound of how much memory each task can occupy. This
    +      // must take into account potential free memory as well as the amount this pool currently
    +      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,
    +      // we did not take into account space that could have been freed by evicting cached blocks.
    +      val maxPoolSize = computeMaxPoolSize()
    +      val maxMemoryPerTask = maxPoolSize / numActiveTasks
    +      val minMemoryPerTask = poolSize / (2 * numActiveTasks)
    +
    +      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks
    +      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))
           // Only give it as much memory as is free, which might be none if it reached 1 / numTasks
           val toGrant = math.min(maxToGrant, memoryFree)
     
    -      if (curMem < poolSize / (2 * numActiveTasks)) {
    +      if (curMem < minMemoryPerTask) {
    --- End diff --
    
    I was able to prove this myself. I summarized my thoughts in this gist: https://gist.github.com/andrewor14/aea58796dd25d2ec9f20
    
    That said, I would still prefer to do this separately since this PR is already passing tests. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163785673
  
    Thanks, merging into master 1.6.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12155] [SPARK-12253] Fix executor OOM i...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10240#issuecomment-163519195
  
    **[Test build #47491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47491/consoleFull)** for PR 10240 at commit [`d8be669`](https://github.com/apache/spark/commit/d8be66911d2abf3da46a25a54a7d80fd1eeebdfa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org