You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by kayousterhout <gi...@git.apache.org> on 2015/04/07 23:17:38 UTC

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/5403

    [SPARK-3376] Add in-memory shuffle option.

    This commit adds a new ShuffleManager that stores all shuffle data
    in-memory using the block manager.
    
    With this change, in-memory shuffle data is cleaned up in the
    same way as disk shuffle data: using the metadata cleaner. It
    would probably be better to clean up in-memory shuffle data more
    aggressively, in order to avoid running out of memory.
    
    One idea proposed by @pwendell was to publish this as a Spark
    package rather than adding it directly to Spark. The concern is that
    this option may be confusing to naive users, because it can result in
    Spark's memory filling up in unexpected ways.  If it is decided that
    this should actually be added to Spark, I can update the documentation
    to show the in-memory shuffle as one of the possible shuffle managers.
    
    cc @shivaram 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 SPARK-3376

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5403.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5403
    
----
commit 293fd69965be4fd5003c19e9cf85cf0c84bf7d54
Author: Kay Ousterhout <ka...@gmail.com>
Date:   2014-10-15T00:45:55Z

    [SPARK-3376] Add in-memory shuffle option.
    
    With this change, in-memory shuffle data is cleaned up in the
    same way as disk shuffle data: using the metadata cleaner. It
    would probably be better to clean up in-memory shuffle data more
    aggressively, in order to avoid running out of memory.

commit 57fb068ea28f9754dddd1961712ad04d9759b8a6
Author: Kay Ousterhout <ka...@gmail.com>
Date:   2015-04-07T21:14:30Z

    Added missing newlines

commit 00bdcd6f1cd6035108ed4c27ac5242ed70a3a286
Author: Kay Ousterhout <ka...@gmail.com>
Date:   2015-04-07T21:17:10Z

    Fixed header

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90776436
  
    Also one other consideration is that if you set spark.local.dir to mem disk, you can't persist other RDDs to disk, which you might want to do even if shuffle data is in-memory (although I suppose this might be unlikely?).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-96269096
  
    Actually I lied - in the codebase we do have some flags we use only for performance analysis. One is "spark.shuffle.sync" which forces writes to sync to disk much more aggresively. It has no use to end-users other than for doing performance analysis of the shuffle. Though it is much more hidden and low level than this proposal here.
    
    https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L650


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90736942
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29813/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90736535
  
      [Test build #29813 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29813/consoleFull) for   PR 5403 at commit [`00bdcd6`](https://github.com/apache/spark/commit/00bdcd6f1cd6035108ed4c27ac5242ed70a3a286).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by mikeringenburg <gi...@git.apache.org>.

Github user mikeringenburg commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-95684732
  
    I agree with Shivaram's point that removing all filesystem dependencies is a valuable point in the design space.  For deployments without local filesystems, but lots of RAM, having this option available (even if only as an experimental mode) would be valuable.  As mentioned above, using a RAM disk for this can be tricky.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by uncleGen <gi...@git.apache.org>.

Github user uncleGen commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-91403180
  
    Definitely, sort-based shuffle and hash-based shuffle are better for general use. However IMHO, In-Memory shuffle may has its value for some performance optimization. In my humble test, In-Memory shuffle has a better presentation. What is more, we just need to provide large memory without overhead of parameters tune for sort-based shuffle or hash-based shuffle. I will feel happy if provided this new feature. Maybe, we can just provide a experimental feature. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-94588226
  
    Since it's a pretty simple implementation, I'd be fine if it were merged in. But I think we should say clearly that it can be useful for benchmarking, etc, but isn't meant to be used in production setting since it's not robust to OOM. /cc @rxin for his thoughts also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/5403#issuecomment-90775140

Is it more easier to set spark.local.dir to mem disk and store these shuffle files to ram disks? It is more easier and less GC overhead from my understanding :smile: .

Besides, I think shuffle framework could have two layers of abstraction:
1. From implementations, there has sort-based shuffle and hash-based shuffle.
2. From storing side, we could choose file-based mapping to map each bucket to one file or one part of file; Or as you implemented like memory-based mapping which mapping one bucket to memory block, or anythings else.

So basically these two layers could have different combinations, like sort-based on-memory or sort-based on disk shuffle, hash-based on-memory or hash-based on disk shuffle... I think here the implementation is hash-based on-memory shuffle, so what do you think if we want sort-based on-memory shuffle?

Just my thoughts, thanks a lot.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/5403#issuecomment-96267909

My understanding was this wouldn't be an "experimental" feature in terms of how we've defined that in the past (i.e. it's not on a path to being something we'd expect people to ever use in production). It would just be an internal flag we could set when doing measurement work. We've never had such a feature before in the code base, so it's a bit of a question about whether we want to do that in general.

I'm neutral to slightly positive on that idea. I think we'd just have it be undocumented and print a large warning that it's only for benchmarking work. I don't think this would add much more burden if the shuffle interface changes because it is ultimately much simpler than either of the existing two shuffles. I don't think this can possibly exist outside of the codebase because it uses the block storage API's.

On the other hand, I can see this percolating around on mailing lists, etc as a way to "speed up your spark job". So there is an element of wanting to protect users from themselves and not have this in the codebase in a way that's easily accessible.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-96965047
  
    Maybe let's discuss this after 1.4.0 deadline? I just don't see much benefits to users, other than running benchmarks or academic studies. Over time I'm pretty sure this code path will fall behind, and not as optimized, which will then be bad for academic studies. So this at best will be good for benchmarks in a short amount of time (maybe 1 or 2 releases).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90736940
  
      [Test build #29813 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29813/consoleFull) for   PR 5403 at commit [`00bdcd6`](https://github.com/apache/spark/commit/00bdcd6f1cd6035108ed4c27ac5242ed70a3a286).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by mikeringenburg <gi...@git.apache.org>.

Github user mikeringenburg commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-93057465
  
    We'd be very interesting in trying this out - we've been pointing spark.local.dir to a ram disk when we need better shuffle performance, but have run into a number of issues with that model.  We'd love to try out this in-memory shuffle to see if it helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by shivaram <gi...@git.apache.org>.

Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-94905121
  
    My two cents: I think the main reason to merge this is that one doesn't need to maintain patches out of tree that'll become outdated when the interface changes :) Also I think this patch completely gets rid of all filesystem overheads (like file open, close, flush etc.) which is a pretty useful design point. We've seen filesystem types being a big issue in the past and thus it'll be useful not just for research benchmarks but also for spark-perf with things like 'how much time does a small shuffle using files takes compared to this etc.'


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-97144931
  
    It seems like there are two separate issues here:
    
    (1) Should Spark ever have an in-memory shuffle?  Personally I think we should, partially because it's useful for benchmarking and partially because there are some environments (as @mikeringenburg  pointed out) where it makes more sense to store shuffle data in-memory (for performance reasons or cluster provisioning reasons etc.).  However, @rxin, it sounds like you're pretty strongly against this for maintainability reasons; if you're going to block all attempts at doing this, we should just close SPARK-3376 as "Will not fix".
    
    (2) If yes to the above question, should we add *this particular* in-memory shuffle?  To list a few reasons why we might not want this implementation:
    -In it's current form, this implementation doesn't clean up in-memory shuffle files any more aggressively than normal shuffle files are cleaned up (so the shuffle data won't be deleted until the associated RDD goes out of scope).  In-memory shuffle data should really be cleaned up more aggressively, because unlike when we store shuffle data on-disk, there's a high cost of keeping the data around.
    -In addition to doing better cleanup of shuffle data, we likely would want to store shuffle data as a separate storage level (or with some kind of tag) so we can more cleanly fail when shuffle data becomes too large (i.e., explicitly fail with a "out of memory for shuffle" kind of exception, rather than a generic OOM).
    Parts of these issues are small and could just be fixed as part of this PR, while others are more substantial.  @sryza and @pwendell, it would help if you two could describe what you'd like to see in an ideal version of this feature, to understand whether they're things that can just be fixed as part of this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-97151475
  
    My opinion is that the main criteria for including this are:
    * Is there an intention for and clear path to a state where this could be used in production? I think it's likely that this means adding spill functionality so we never OOM.
    * Is there somebody interested in actively maintaining this? There's already a pretty severe dearth of knowledge about shuffle workings. If we want to scale its complexity, we need to scale the number of contributors that understand it.
    
    > On Apr 28, 2015, at 10:32 AM, Kay Ousterhout <no...@github.com> wrote:
    > 
    > It seems like there are two separate issues here:
    > 
    > (1) Should Spark ever have an in-memory shuffle? Personally I think we should, partially because it's useful for benchmarking and partially because there are some environments (as @mikeringenburg pointed out) where it makes more sense to store shuffle data in-memory (for performance reasons or cluster provisioning reasons etc.).  However, @rxin, it sounds like you're pretty strongly against this for maintainability reasons; if you're going to block all attempts at doing this, we should just close SPARK-3376 as "Will not fix".
    > 
    > (2) If yes to the above question, should we add this particular in-memory shuffle? To list a few reasons why we might not want this implementation:
    > -In it's current form, this implementation doesn't clean up in-memory shuffle files any more aggressively than normal shuffle files are cleaned up (so the shuffle data won't be deleted until the associated RDD goes out of scope).  In-memory shuffle data should really be cleaned up more aggressively, because unlike when we store shuffle data on-disk, there's a high cost of keeping the data around.
    > -In addition to doing better cleanup of shuffle data, we likely would want to store shuffle data as a separate storage level (or with some kind of tag) so we can more cleanly fail when shuffle data becomes too large (i.e., explicitly fail with a "out of memory for shuffle" kind of exception, rather than a generic OOM).
    > Parts of these issues are small and could just be fixed as part of this PR, while others are more substantial. @sryza and @pwendell, it would help if you two could describe what you'd like to see in an ideal version of this feature, to understand whether they're things that can just be fixed as part of this PR.
    > 
    > —
    > Reply to this email directly or view it on GitHub.
    > 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90776257
  
    @jerryshao I've found it can be tricky to configure a ram disk to be the correct size for this, making something like this easier.  However, if folks have generally found that to be a suitable solution, I'm happy to just close this patch.
    
    Re: the two layers of abstraction, I don't think there's any reason to do a sort-based shuffle in-memory.  The point of the sort-based shuffle is to improve Spark's use of disk by storing just one file for each map task, rather than opening <# reduce tasks> files for each map task (which makes some file systems like ext3 struggle, and also leads to much seek-ier disk use).  As long as data gets stored in memory, I can't think of any reason why using the sort-based shuffle would improve performance (and there is some - likely small - performance cost of sorting all of the data).  Are there other reasons you can think of that you'd want to use an in-memory version of the sort-based shuffle?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-94689893
  
    I'm not sure how realistic this can be outside benchmarking mode. Also we will likely work on some code to substantially speed up shuffle, and as a result maybe this won't be as necessary? My worry with checking this in is that it will make it slightly harder to change the interfaces in the future (just more files to muck around).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-95686597
  
    I've seen a lot of experimental features go into Hadoop and then slowly wilt from lack of use and maintenance.
    
    My two cents are that we should only include an experimental feature when we have a concrete plan for how it can become production-ready, including a developer with the intention of carrying it out and maintaining it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-96436773
  
    By the way - if we did end up deciding to include this, I do feel that:
    
    1. We should not mark this as solving SPARK-3376 (the goal there was to build a production thing to speed up real workloads).
    2. We should name it something like "org.apache.spark.shuffle.measurement.BenchmarkShuffleManager" or something where it is really clear that it's for internal measurements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90768295
  
    One idea here is to store the shuffle data at MEMORY_OR_DISK storage level, so that things degrade more gracefully when memory becomes contended.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90781460
  
    Thanks a lot for your reply. Just my rough thought, I think if full sort-based shuffle (with sort shuffle reader) enabled as [SPARK-2926](https://issues.apache.org/jira/browse/SPARK-2926) mentioned, the performance of sort-based shuffle in some cases like sort-by-key required (sort-merge-join) is still better than hash-based shuffle even in-memory as I think. But for now as you said hash-based shuffle in more better than sort-based shuffle for the current implementation. 
    
    Also I think if this patch focus on benchmark, we need to well tune to make no spill in disk, in the current implementation, there's still some spilled files in disk (like ExternalAppendOnlyMap), so it depends on how to say this, if we target on benchmark, then it would be better all the data are on memory, so using mem disk is the same as this solution, but probably will get better performance (GC issue).
    
    Just my instant thought, I've no concrete reason to debate on this, sorry for any misunderstanding :smiley: .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-90736418
  
    Yeah so my feeling on this one is I'm sure it's really useful for benchmarks where you can size things so that data is in memory, but I'd be really hesitant to expose this to the average Spark user. For instance, you could have some increase in the input size of your data and then suddenly, for no reason, your production job now fails with an out-of-memory exception. That seems like it could easily cause bad user experience for people so I just wondered if this could be maintained as a third party package.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by jegonzal <gi...@git.apache.org>.

Github user jegonzal commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-97975070
  
    This PR could have important performance implications for algorithms in GraphX and MLlib (e.g., ALS) which introduce relatively lightweight shuffle stages at each iteration. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3376] Add in-memory shuffle option.

Posted by mikeringenburg <gi...@git.apache.org>.

Github user mikeringenburg commented on the pull request:

    https://github.com/apache/spark/pull/5403#issuecomment-97247933
  
    In response to @sryza's second question, I'm very interested in this PR, and would be happy to volunteer to help out wherever it's needed, including maintenance if necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org