You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by hlin09 <gi...@git.apache.org> on 2015/04/18 05:35:40 UTC

[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

GitHub user hlin09 opened a pull request:

    https://github.com/apache/spark/pull/5568

    [SPARK-6991] [SparkR] Adds support for zipPartitions.

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hlin09/spark zipPartitions

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5568.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5568
    
----
commit 27655d31d9ecfa848dcc40baa7a328acc30f7b22
Author: hlin09 <hl...@gmail.com>
Date:   2015-04-18T03:22:28Z

    Adds support for zipPartitions.

commit ec56d2ff5b2fd0b274b4f632a70b0e8cc5be4a71
Author: hlin09 <hl...@gmail.com>
Date:   2015-04-18T03:34:32Z

    Fix test.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5568#discussion_r29169055
  
    --- Diff: R/pkg/R/RDD.R ---
    @@ -1529,3 +1529,50 @@ setMethod("zipRDD",
                 
                 PipelinedRDD(zippedRDD, partitionFunc)
               })
    +
    +#' Zip an RDD's partitions with one (or more) RDD(s).
    +#'
    +#' Return a new RDD by applying a function to the zipped partitions. 
    +#' Assumes that all the RDDs have the same number of partitions*, but does *not* 
    +#' require them to have the same number of elements in each partition.
    +#' 
    +#' @param x An RDD to be zipped.
    +#' @param other Another RDD to be zipped.
    +#' @return An RDD zipped from the two RDDs.
    +#' @examples
    +#'\dontrun{
    +#' sc <- sparkR.init()
    +#' rdd1 <- parallelize(sc, 1:2, 2L)  # 1, 2
    +#' rdd2 <- parallelize(sc, 1:4, 2L)  # 1:2, 3:4
    +#' rdd3 <- parallelize(sc, 1:6, 2L)  # 1:3, 4:6
    +#' collect(zipPartitions(rdd1, rdd2, rdd3, 
    +#'                       func = function(x, y, z) { list(list(x, y, z))} ))
    +#' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6)))
    +#'}
    +#' @rdname zipRDD
    +#' @aliases zipPartitions,RDD
    +setMethod("zipPartitions",
    +          "RDD",
    +          function(..., func) {
    +            rrdds <- list(...)
    +            if (length(rrdds) == 1) {
    +              return(rrdds[[1]])
    +            }
    +            nPart <- sapply(rrdds, numPartitions)
    +            if (length(unique(nPart)) != 1) {
    +              stop("Can only zipPartitions RDDs which have the same number of partitions.")
    +            }
    +            
    +            rrdds <- lapply(rrdds, function(rdd) {
    +              mapPartitionsWithIndex(rdd, function(split, part) {
    --- End diff --
    
    We've recently changed all usage of `split` to `partIndex` or something like `idx` is fine too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by hlin09 <gi...@git.apache.org>.
Github user hlin09 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5568#discussion_r29173379
  
    --- Diff: R/pkg/R/RDD.R ---
    @@ -1529,3 +1529,50 @@ setMethod("zipRDD",
                 
                 PipelinedRDD(zippedRDD, partitionFunc)
               })
    +
    +#' Zip an RDD's partitions with one (or more) RDD(s).
    +#'
    +#' Return a new RDD by applying a function to the zipped partitions. 
    +#' Assumes that all the RDDs have the same number of partitions*, but does *not* 
    +#' require them to have the same number of elements in each partition.
    +#' 
    +#' @param x An RDD to be zipped.
    +#' @param other Another RDD to be zipped.
    +#' @return An RDD zipped from the two RDDs.
    +#' @examples
    +#'\dontrun{
    +#' sc <- sparkR.init()
    +#' rdd1 <- parallelize(sc, 1:2, 2L)  # 1, 2
    +#' rdd2 <- parallelize(sc, 1:4, 2L)  # 1:2, 3:4
    +#' rdd3 <- parallelize(sc, 1:6, 2L)  # 1:3, 4:6
    +#' collect(zipPartitions(rdd1, rdd2, rdd3, 
    +#'                       func = function(x, y, z) { list(list(x, y, z))} ))
    +#' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6)))
    +#'}
    +#' @rdname zipRDD
    +#' @aliases zipPartitions,RDD
    +setMethod("zipPartitions",
    +          "RDD",
    +          function(..., func) {
    +            rrdds <- list(...)
    +            if (length(rrdds) == 1) {
    +              return(rrdds[[1]])
    +            }
    +            nPart <- sapply(rrdds, numPartitions)
    +            if (length(unique(nPart)) != 1) {
    +              stop("Can only zipPartitions RDDs which have the same number of partitions.")
    +            }
    +            
    +            rrdds <- lapply(rrdds, function(rdd) {
    +              mapPartitionsWithIndex(rdd, function(split, part) {
    --- End diff --
    
    Done.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-94130021
  
      [Test build #30512 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30512/consoleFull) for   PR 5568 at commit [`ec56d2f`](https://github.com/apache/spark/commit/ec56d2ff5b2fd0b274b4f632a70b0e8cc5be4a71).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/5568


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-94122399
  
      [Test build #30512 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30512/consoleFull) for   PR 5568 at commit [`ec56d2f`](https://github.com/apache/spark/commit/ec56d2ff5b2fd0b274b4f632a70b0e8cc5be4a71).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5568#discussion_r29169169
  
    --- Diff: R/pkg/R/RDD.R ---
    @@ -1529,3 +1529,50 @@ setMethod("zipRDD",
                 
                 PipelinedRDD(zippedRDD, partitionFunc)
               })
    +
    +#' Zip an RDD's partitions with one (or more) RDD(s).
    +#'
    +#' Return a new RDD by applying a function to the zipped partitions. 
    +#' Assumes that all the RDDs have the same number of partitions*, but does *not* 
    +#' require them to have the same number of elements in each partition.
    +#' 
    +#' @param x An RDD to be zipped.
    +#' @param other Another RDD to be zipped.
    --- End diff --
    
    The doc here needs to be updated: (1) var-arg arity (2) document `func`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-94624336
  
    Thanks @hlin09 - one high-level question: do you have an intended use case in mind for arbitrary / large arity functions? In Spark's Scala API up to 4 RDDs are supported, which has perf. benefits (avoid an union + shuffle).  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-96830292
  
    Hmm this is weird - somehow the test results for this PR were never posted to github. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31005/consoleFull reports a json parsing error
    
    cc @shaneknapp 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by hlin09 <gi...@git.apache.org>.
Github user hlin09 commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-94856698
  
    I do have some user cases that would best to have more that 4 arity zip function. Another concern is that the current communication pattern between Spark and R does not allows tuples of more than RDDs. From perf point of view, "union + shuffle" might be better than recursively zip 2 of them. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by hlin09 <gi...@git.apache.org>.
Github user hlin09 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5568#discussion_r29173393
  
    --- Diff: R/pkg/R/RDD.R ---
    @@ -1529,3 +1529,50 @@ setMethod("zipRDD",
                 
                 PipelinedRDD(zippedRDD, partitionFunc)
               })
    +
    +#' Zip an RDD's partitions with one (or more) RDD(s).
    +#'
    +#' Return a new RDD by applying a function to the zipped partitions. 
    +#' Assumes that all the RDDs have the same number of partitions*, but does *not* 
    +#' require them to have the same number of elements in each partition.
    +#' 
    +#' @param x An RDD to be zipped.
    +#' @param other Another RDD to be zipped.
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-96814334
  
    LGTM. /cc @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5568#discussion_r29168961
  
    --- Diff: R/pkg/R/RDD.R ---
    @@ -1529,3 +1529,50 @@ setMethod("zipRDD",
                 
                 PipelinedRDD(zippedRDD, partitionFunc)
               })
    +
    +#' Zip an RDD's partitions with one (or more) RDD(s).
    +#'
    +#' Return a new RDD by applying a function to the zipped partitions. 
    +#' Assumes that all the RDDs have the same number of partitions*, but does *not* 
    --- End diff --
    
    nit: unmatched *


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by hlin09 <gi...@git.apache.org>.
Github user hlin09 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5568#discussion_r29173368
  
    --- Diff: R/pkg/R/RDD.R ---
    @@ -1529,3 +1529,50 @@ setMethod("zipRDD",
                 
                 PipelinedRDD(zippedRDD, partitionFunc)
               })
    +
    +#' Zip an RDD's partitions with one (or more) RDD(s).
    +#'
    +#' Return a new RDD by applying a function to the zipped partitions. 
    +#' Assumes that all the RDDs have the same number of partitions*, but does *not* 
    --- End diff --
    
    Thanks. Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-94130024
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30512/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by concretevitamin <gi...@git.apache.org>.
Github user concretevitamin commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-96751034
  
    @hlin09 - cool, I don't have a strong opinion towards either approach for now. Did a pass and left some minor style comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-96767565
  
      [Test build #31005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31005/consoleFull) for   PR 5568 at commit [`12c08a5`](https://github.com/apache/spark/commit/12c08a5401794e3cd0a1d9e801885c034799c93d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-96834850
  
    I also re-ran the R tests locally to just verify -- Merging this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/5568#issuecomment-96816065
  
    Thanks @concretevitamin for the review. Will merge this after Jenkins passes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org