You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mingyukim <gi...@git.apache.org> on 2015/02/06 07:06:28 UTC

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

GitHub user mingyukim opened a pull request:

    https://github.com/apache/spark/pull/4420

    [SPARK-4808] Configurable spillable memory threshold + sampling rate

    In the general case, Spillable's heuristic of checking for memory stress
    on every 32nd item after 1000 items are read is good enough. In general,
    we do not want to be enacting the spilling checks until later on in the
    job; checking for disk-spilling too early can produce unacceptable
    performance impact in trivial cases.
    
    However, there are non-trivial cases, particularly if each serialized
    object is large, where checking for the necessity to spill too late
    would allow the memory to overflow. Consider if every item is 1.5 MB in
    size, and the heap size is 1000 MB. Then clearly if we only try to spill
    the in-memory contents to disk after 1000 items are read, we would have
    already accumulated 1500 MB of RAM and overflowed the heap.
    
    Patch #3656 attempted to circumvent this by checking the need to spill
    on every single item read, but that would cause unacceptable performance
    in the general case. However, the convoluted cases above should not be
    forced to be refactored to shrink the data items. Therefore it makes
    sense that the memory spilling thresholds be configurable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark memory-spill-configurable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4420
    
----
commit 84afd105a0f27ef92c7034cbea97b74ea286232c
Author: mcheah <mc...@palantir.com>
Date:   2015-02-05T15:02:11Z

    [SPARK-4808] Configurable spillable memory threshold + sampling rate
    
    In the general case, Spillable's heuristic of checking for memory stress
    on every 32nd item after 1000 items are read is good enough. In general,
    we do not want to be enacting the spilling checks until later on in the
    job; checking for disk-spilling too early can produce unacceptable
    performance impact in trivial cases.
    
    However, there are non-trivial cases, particularly if each serialized
    object is large, where checking for the necessity to spill too late
    would allow the memory to overflow. Consider if every item is 1.5 MB in
    size, and the heap size is 1000 MB. Then clearly if we only try to spill
    the in-memory contents to disk after 1000 items are read, we would have
    already accumulated 1500 MB of RAM and overflowed the heap.
    
    Patch #3656 attempted to circumvent this by checking the need to spill
    on every single item read, but that would cause unacceptable performance
    in the general case. However, the convoluted cases above should not be
    forced to be refactored to shrink the data items. Therefore it makes
    sense that the memory spilling thresholds be configurable.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74972296
  
    Thanks for the response. To be clear, I understand the hesitation with exposing knobs. So, my proposal was to throttle the frequency of spills by how much memory is acquired from the shuffle memory manager at a time (e.g. if you ask 100MB at a time, you won't have spill files smaller than 100MB), but I understand that this will also need some tuning depending on the executor heap size.
    
    That said, I'm trying to check if your simpler proposal of effectively setting `trackMemoryThreshold=0` will fix the particular workflow we have. If that fixes our problem and @andrewor14 says this is good to go, I'm fine with this as a fix for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73293800
  
    Thanks @mingyukim. On second thought I'm not so sure if we should make these configurable, since doing so kind of pushes the responsibility to the user to experiment with finding the optimal value for these super advanced tuning parameters himself. I think we should explore other alternatives like using the memory size as an additional heuristic instead of introducing highly advanced configs that we can't remove in the future because of backward compatibility reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75178062
  
      [Test build #27759 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27759/consoleFull) for   PR 4420 at commit [`6e2509f`](https://github.com/apache/spark/commit/6e2509f060c1acb6706e575b66fb953dba9a424c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75178549
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27760/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73187536
  
    This is following up @andrewor14's comments on #3656. It makes the threshold and frequency configurable rather than completely removing them. Please let me know if I should add documentation for these configurations as well!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75178396
  
    Ok I merged this into master 1.3 and 1.2 thanks guys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75178070
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27759/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim closed the pull request at:

    https://github.com/apache/spark/pull/4420


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73492519
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75170813
  
      [Test build #27760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27760/consoleFull) for   PR 4420 at commit [`6e2509f`](https://github.com/apache/spark/commit/6e2509f060c1acb6706e575b66fb953dba9a424c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75139195
  
    Great. Checking every N bytes is very similar to what I was suggesting. We'd be happy to do that now. What would you use for the value N? Would you make it configurable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75162077
  
    I would not make it configurable for the same reason. I don't have a great sense of what the value should be off the top of my head. It would be good to deduce it from empirical data collected from real workloads of varying requirements (a few large items, many large items, mostly small items, very few small items etc.). By the way if you intend to do that please also change the description and the title on the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73566542
  
    @mingyukim we can't just remove the check because this is a hot code path. Otherwise we'll try to access a synchronized block inside `tryToAcquire` every single element (not partition) and that's really expensive. I think a threshold that looks something like `currentMemory - myMemoryThreshold` might make sense as a simple fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74828351
  
    @pwendell, I mentioned above, but at a high level, wouldn't it be better to control the frequency of spills by how much memory you acquire from the shuffle memory manager at a time than by how often you check if spill is needed?
    
    We can test out your proposal in the specific case we have if that's a less risky option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73503883
  
      [Test build #27095 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27095/consoleFull) for   PR 4420 at commit [`84afd10`](https://github.com/apache/spark/commit/84afd105a0f27ef92c7034cbea97b74ea286232c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75170335
  
      [Test build #27759 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27759/consoleFull) for   PR 4420 at commit [`6e2509f`](https://github.com/apache/spark/commit/6e2509f060c1acb6706e575b66fb953dba9a424c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75658190
  
    That's fine with us. I'm not entirely familiar with the branching model here, but does that mean this will be merged into branch-1.3 after 1.3.0 is released, and branch-1.3 will later be merged into master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75178548
  
      [Test build #27760 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27760/consoleFull) for   PR 4420 at commit [`6e2509f`](https://github.com/apache/spark/commit/6e2509f060c1acb6706e575b66fb953dba9a424c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73503891
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27095/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

https://github.com/apache/spark/pull/4420#issuecomment-73581628

To be clear, I don't mean to do spill on every element. I mean to do it only when `currentMemory > myMemoryThreshold`. When `currentMemory > myMemoryThreshold`, you have two options. 1) Not try spilling and wait until you have enough elements cumulated since the last spill, which is what it's doing now, or 2) try spilling in order to avoid potential OOMs. Since spill is a feature that prevents executors from OOMing, I believe 2 is the right way.

I agree with you that spill shouldn't be done too frequently. However, # of elements is a wrong proxy to use for controlling the frequency of spills because you never know how large each element will be. # of bytes is a better proxy to use here, so the Spillable data structure cumulates at least N bytes worth of elements (as opposed to at least M elements) before spilling again. You can do this by acquiring at least N bytes from shuffleMemoryManager at a time.

What do you think?

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by mingyukim <gi...@git.apache.org>.

GitHub user mingyukim reopened a pull request:

    https://github.com/apache/spark/pull/4420

    [SPARK-4808] Removing minimum number of elements read before spill check

    In the general case, Spillable's heuristic of checking for memory stress
    on every 32nd item after 1000 items are read is good enough. In general,
    we do not want to be enacting the spilling checks until later on in the
    job; checking for disk-spilling too early can produce unacceptable
    performance impact in trivial cases.
    
    However, there are non-trivial cases, particularly if each serialized
    object is large, where checking for the necessity to spill too late
    would allow the memory to overflow. Consider if every item is 1.5 MB in
    size, and the heap size is 1000 MB. Then clearly if we only try to spill
    the in-memory contents to disk after 1000 items are read, we would have
    already accumulated 1500 MB of RAM and overflowed the heap.
    
    Patch #3656 attempted to circumvent this by checking the need to spill
    on every single item read, but that would cause unacceptable performance
    in the general case. However, the convoluted cases above should not be
    forced to be refactored to shrink the data items. Therefore it makes
    sense that the memory spilling thresholds be configurable.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark memory-spill-configurable

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4420.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4420
    
----
commit 6e2509f060c1acb6706e575b66fb953dba9a424c
Author: mcheah <mc...@palantir.com>
Date:   2015-02-20T00:43:26Z

    [SPARK-4808] Removing minimum number of elements read before spill check
    
    We found that if we only start checking for spilling after reading 1000
    elements in a Spillable, we would run out of memory if there are fewer
    than 1000 elements but each of those elements are very large.
    
    There is no real need to only check for spilling after reading 1000
    things. It is still necessary, however, to mitigate the cost of entering
    a synchronized block. It turns out in practice however checking for
    spilling only on every 32 items is sufficient, without needing the
    minimum elements threshold.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74989369
  
    It sounds to me that checking for every N bytes however is the most robust approach. We can implement this immediately if that's the right fix. I think this is what @mingyukim suggested. How does that sound, @mingyukim @pwendell @andrewor14?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/4420#issuecomment-74987043

@mingyukim @mccheah
Just to clarify, the reason we have the every N elements check is not to avoid spilling every element (unlikely), but to avoid entering the synchronized block every element. The latter is expensive if we have multiple threads trying to unroll blocks in memory.

I think the original motivation for `trackMemoryThreshold` is to optimize the case with very small partitions. In such cases we currently don't even try to enter the synchronized block at all, the potential cost of which could be a non-trivial fraction of the relatively short aggregation time. However, this is all conjecture and I don't believe that this optimization is crucial for such workloads, so I don't see a particular reason to keep this threshold if removing it means that we can handle large items.

However, simply removing the threshold won't solve the general case. There could be a burst of large items in the middle of the stream, in which case it will still buffer too many (32) of them in memory. One potential fix for this is to check every N bytes as opposed to every X elements. This is cheap because we already have an estimate of the map size before and after an element is inserted.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73187302
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75817377
  
    Actually, I discussed this with @pwendell more offline. We should avoid changing the performance between 1.3.0 and future 1.3.x releases in such a subtle way, as it may cause regressions in a few cases. Therefore it's best to just merge this change in 1.4.0 and revert the changes in all older branches. I will do so and update the JIRA accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

https://github.com/apache/spark/pull/4420#issuecomment-73303667

Can you elaborate on the "memory size as an additional heuristic" idea? This is consistently causing OOMs in one of our workflows, which is exactly what spilling to disk is supposed to handle. I'm happy to work on it on my end if you have suggestions.

A few ideas off the top of my head are,
- Have a threshold on {currentMemory - myMemoryThreshold} value so it tries to spill if the difference gets big enough.
- In fact, why not remove the entire threshold check just like how it was originally suggested in #3656? You can tweak how often the spill is done by setting a minimum on the amount of memory you request from ShuffleMemoryManager. Then, you're guaranteed that the spill files are not too small. You still get too many files? Well.. that's unavoidable. Your shuffle is really big, so you'd have to spill a lot. Otherwise, your JVM will OOM. Basically, I don't think trackMemoryThreshold and trackMemoryFrequency are the right way to control your spill frequency or spill file size, since it can lead to OOMs when each element is large.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75169280
  
    Okay we need this soon on our side, so we will do the short term fix of removing the sample memory threshold and not make anything configurable. We will open another Spark ticket to do this more robustly by counting bytes, and push another pull request for that as a more permanent solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74987312
  
    In case my previous message was too long, here's the TL;DR:
    
    If you remove the `trackMemoryThreshold` here, I will merge this patch. However, for now we need to keep some notion of spacing out the checks (i.e. every 32 items) for performance reasons. In a separate patch, it may be worthwhile to check every N bytes instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-73493008
  
      [Test build #27095 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27095/consoleFull) for   PR 4420 at commit [`84afd10`](https://github.com/apache/spark/commit/84afd105a0f27ef92c7034cbea97b74ea286232c).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74827293
  
    If every single object is large though, then in that case after we've spilled the 32nd object, there would still be an OOM before we check for spilling again, right? I don't know if there's a completely OOM-proof solution aside from checking for spilling on every single element though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4420


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75447262
  
    Hi all, I will revert this in branch 1.3 for now because this was technically merged after the first RC, and people may have already run performance tests on this. Per @pwendell it's safer to just back port this later than to potentially introduce a regression. I will mark this as such on the JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74558171
  
    @andrewor14 what do you think about the comments from @mingyukim ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by mingyukim <gi...@git.apache.org>.

Github user mingyukim commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75173229
  
    Filed https://issues.apache.org/jira/browse/SPARK-5915 for the follow-up.
    
    Thanks, Patrick and Andrew, for the discussions!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75171084
  
    Ok, LGTM. I will merge this once tests pass. Could you file another JIRA for the tracking based on bytes thing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Removing minimum number of elemen...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-75170336
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/4420#issuecomment-74750269
  
    I wonder if maybe we could improve these heuristics instead of adding user flags (for many users it might be hard to figure out how to set these). The heuristic of skipping the first 1000 entries assumes that the entries are small - I actually don't see what we gain by skipping those entires, since we still perform the sampling during this time (the sampling is the only really expensive part of these heuristics). @andrewor14, do you remember why this is there?
    
    Also, maybe for the first 32 elements we could check every element, then fall back to checking every 32 elements. This would handle the case where there are some extremely large objects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-4808] Configurable spillable memory thr...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4420#issuecomment-74946838

@mccheah @mingyukim yeah, there isn't an OOM proof solution at all because these are all heuristics. Even checking every element is not OOM proof since memory estimation is itself a heuristic that involves sampling. My only concern with exposing knobs here is that users will expect us to support these going forward, even though we may want to refactor this in the future in a way where those knobs don't make sense anymore. It's reasonable users would consider it a regression if their tuning of those knobs stopped working.

So if possible, it would be good to adjust our heuristics to meet a wider range of use cases and then if we keep hearing more issues we can expose knobs. We can't have them meet every possible use case, since they are heuristics, but in this case I was wondering if we could have a strict improvement to the heuristics. @andrewor14 can you comment on whether this is indeed a strict improvement?

One of the main benefits of the new data frames API is that we will be able to have precise control over memory usage in a way that can avoid OOM's ever. But for the current Spark API we are using this more ad-hoc memory estimation along with some heuristics.

I'm not 100% against exposing knobs either, but I'd be interested if some simple improvements fix your use case.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org