You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by rxin <gi...@git.apache.org> on 2014/07/17 01:34:56 UTC

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

GitHub user rxin opened a pull request:

    https://github.com/apache/spark/pull/1450

    [SPARK-2534] Avoid pulling in the entire RDD in groupByKey.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rxin/spark agg-closure

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1450
    
----
commit 73b2783fef785941fc966ad32f2fd987b12447ae
Author: Reynold Xin <rx...@apache.org>
Date:   2014-07-16T23:34:34Z

    [SPARK-2534] Avoid pulling in the entire RDD in groupByKey.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49350307
  
    Merged in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1450


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1450#discussion_r15040306
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("reduceByKeyLocally() does not support array keys")
         }
     
    -    def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
    +    val reducePartition = (iter: Iterator[(K, V)]) => {
    --- End diff --
    
    I have to push back on the loss of the return type here, since I don't think it's obvious. I know it's kind of a pain to add the whole type specification, though... what would you think about putting a `: Iterator[JHashMap[K, V]]` after the final bracket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49251632
  
    Pushed a new version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49261722
  
    Eh ---- the binary checker is really failing me. Is there a way to disable binary checker for inner functions? @pwendell


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49251845
  
    QA tests have started for PR 1450. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16762/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1450#discussion_r15038311
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -361,11 +361,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         // groupByKey shouldn't use map side combine because map side combine does not
         // reduce the amount of data shuffled and requires all map side data be inserted
         // into a hash table, leading to more objects in the old gen.
    -    def createCombiner(v: V) = ArrayBuffer(v)
    --- End diff --
    
    We should change all of them actually. I will update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49261240
  
    QA results for PR 1450:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16765/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49256304
  
    Jenkins, retest this please.
    
    Flume streaming suite failed. I don't think it is relevant. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49269144
  
    I created a JIRA to deal with this and did some initial exploration, but I think I'll need to wait for Prashant to actually do it:
    
    https://issues.apache.org/jira/browse/SPARK-2549


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1450#discussion_r15038170
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -361,11 +361,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
         // groupByKey shouldn't use map side combine because map side combine does not
         // reduce the amount of data shuffled and requires all map side data be inserted
         // into a hash table, leading to more objects in the old gen.
    -    def createCombiner(v: V) = ArrayBuffer(v)
    --- End diff --
    
    There appear to be ~6 other functions of this type (defs that may be passed into closures), could these also be problematic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49266828
  
    QA tests have started for PR 1450. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1450#discussion_r15040336
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("reduceByKeyLocally() does not support array keys")
         }
     
    -    def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
    +    val reducePartition = (iter: Iterator[(K, V)]) => {
    --- End diff --
    
    That makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49256041
  
    QA results for PR 1450:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16762/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49276870
  
    QA results for PR 1450:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16770/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49242499
  
    Jenkins, why are you so slow ....


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1450#discussion_r15043414
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("reduceByKeyLocally() does not support array keys")
         }
     
    -    def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
    +    val reducePartition = (iter: Iterator[(K, V)]) => {
    --- End diff --
    
    this is fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1450#issuecomment-49256830
  
    QA tests have started for PR 1450. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16765/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2534] Avoid pulling in the entire RDD i...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1450#discussion_r15040327
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -214,7 +214,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("reduceByKeyLocally() does not support array keys")
         }
     
    -    def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
    +    val reducePartition = (iter: Iterator[(K, V)]) => {
    --- End diff --
    
    And when I said non-obvious, I mean just from looking at the function name and input arguments. Here it is actually straightforward to infer from the remaining lines, but in other situations it is less so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---