You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ehnalis <gi...@git.apache.org> on 2015/10/05 15:40:26 UTC

[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

GitHub user ehnalis opened a pull request:

    https://github.com/apache/spark/pull/8979

    [SPARK-10797] RDD's coalesce should not write out the temporary key

    I think we have the following options to solve this problem:
    
    1. [General, but too complex] Create a new ShuffledRDD, add support to shuffle system to read values instead of key-value pairs.
    2. [General, too complex] Create a new CoalescedRDD, that bypasses ShuffledRDD and reads from shuffle-layer directly - still need to add support to read values.
    3. [Minimal impact on code] Add option for shuffle-layer to write out values only and cast the key-value iterator at read-site to Iterator[T].
    
    I've implemented and I'm using the 3rd option. You will experience speed-up based on the size of current payload (values).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ehnalis/spark coalesce

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8979
    
----
commit 6fc5cf4bcef31c0c3d28b9ef011ca5de9f29c873
Author: ehnalis <zo...@gmail.com>
Date:   2015-10-05T13:14:34Z

    Fixed tests.

commit 178a04ebaef9f7b0a03c87589caf58ead93baf30
Author: ehnalis <zo...@gmail.com>
Date:   2015-10-05T13:27:09Z

    [SPARK-10797] RDD's coalesce should not write out the temporary key

commit 2b5f77b5333b93bb62f311bed37694e7f3ab0fd1
Author: ehnalis <zo...@gmail.com>
Date:   2015-10-05T13:29:42Z

    [SPARK-10797] RDD's coalesce should not write out the temporary key

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-149847601
  
    I've looked into your implementation of `ShuffledRowRDD`. This was something that I was thinking of in the first place. Why I stuck with the simpler case is that we only use this upon repartitioning, and it would mean a not-so-beauty code-duplication in some sense. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-145531065
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-146115601
  
      [Test build #1850 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1850/console) for   PR 8979 at commit [`2b5f77b`](https://github.com/apache/spark/commit/2b5f77b5333b93bb62f311bed37694e7f3ab0fd1).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `abstract class ShuffleHandle(val shuffleId: Int) extends Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-164599666
  
    @ehnalis have you done some benchmarking on how much time this actually saves us?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-149662417
  
    Actually, let me think about this a bit more. The approach here might be easier, but I'll need to spend a bit more time to think through this in the context of some other planned shuffle refactorings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-162234560
  
    @JoshRosen, how's your refactoring? Should I rebase to PR again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #8979: [SPARK-10797] RDD's coalesce should not write out ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/8979


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #8979: [SPARK-10797] RDD's coalesce should not write out the tem...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the issue:

    https://github.com/apache/spark/pull/8979
  
    @zzvara this is a pretty cool idea -- we probably want to generalize this a little bit more in Spark 2.1 or Spark 2.2.
    
    I'm going through a list of pull requests to cut them down since the sheer number is breaking some of the tooling we have. Due to lack of activity on this pull request, I'm going to push a commit to close it. Let's keep discussing it in the jira ticket and revisit.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-146114546
  
      [Test build #1850 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1850/consoleFull) for   PR 8979 at commit [`2b5f77b`](https://github.com/apache/spark/commit/2b5f77b5333b93bb62f311bed37694e7f3ab0fd1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by 3ourroom <gi...@git.apache.org>.
Github user 3ourroom commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-164773360
  
    
    NAVER - http://www.naver.com/
    --------------------------------------------
    
    3ourroom@naver.com 님께 보내신 메일 <Re: [spark] [SPARK-10797] RDD's coalesce should not write out the temporary key (#8979)> 이 다음과 같은 이유로 전송 실패했습니다.
    
    --------------------------------------------
    
    받는 사람이 회원님의 메일을 수신차단 하였습니다. 
    
    
    --------------------------------------------



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-146110813
  
    Can I have a test build here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-164769323
  
    On a 4 node YARN cluster I've measured a 38% shuffle-size reduction when the payload was an (Int, String) pair (JavaSerializer), where the average length of String was 5.1 (English word). I did not benchmark time.
    
    Practically you save an `Int` on each record.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-149660588
  
    Hey @ehnalis, I like the core idea behind this patch. However, I'm a bit concerned about how many files it needs to touch and the level of complexity here.
    
    I think that your approach 1, of creating a new ShuffledRDD variant to do value-only shuffle, actually won't be too much work. I did something similar, but optimized only for Spark SQL's `UnsafeRow` format, in #7456. Could you take a look at the implementation of `ShuffledRowRDD` in that patch to assess whether you could do something similar (but more generic) here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-149674387
  
    It does touch a few files, but the level of complexity it introduces is minimal I think. It might be convenient for now, to apply this patch and reconsider a step-by-step shuffle-refactoring. The sad thing is that other approaches would touch `ExternalSorter` aswell as other classes heavily. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-146121481
  
    Can I have a test build here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-10797] RDD's coalesce should not write ...

Posted by ehnalis <gi...@git.apache.org>.
Github user ehnalis commented on the pull request:

    https://github.com/apache/spark/pull/8979#issuecomment-149674890
  
    I'm going to look upon your previous PR tomorrow and will think through a more generic way of solving this, but I think it would mean a huge hit on the shuffle-codebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org