You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by douglaz <gi...@git.apache.org> on 2014/05/18 05:08:31 UTC

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

GitHub user douglaz opened a pull request:

    https://github.com/apache/spark/pull/813

    SPARK-1868: Users should be allowed to cogroup at least 4 RDDs

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/douglaz/spark more_cogroups

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/813.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #813
    
----
commit d46e98ed36ff921bfd98570144b588f2dea4a73b
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date:   2014-05-17T23:27:57Z

    Allow the cogroup of 4 RDDs

commit a7e6e5a2808b69eb565ce88af4f2d14ef4e43d9c
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date:   2014-05-17T23:46:59Z

    Fixed scala style issues

commit 5db9caa858367a73a733357ed611124ab6e4afed
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date:   2014-05-17T23:54:20Z

    Fixed spacing

commit 7680860195c8beed7e5bd834f68d68db91f51408
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date:   2014-05-18T00:36:15Z

    Added java cogroup 4

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45177716
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45183557
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-43579830
  
    Yes, the user can instantiate the RDD and yes this is inconvenient. An interface to do this would be no less inconvenient if it has the same drawbacks (that you need to explicitly convert back the resulting sequences to the original type). 
    
    Limiting the user to 3 cogroups is pretty much like limiting tuples to 3 elements. You may have technical reasons for that limit, but it isn't reasonable for practical purposes. You can't just say: if you need a tuple with more than 3 elements, use lists instead.
    
    For tuples the current limit is 22, which is "enough for everyone". For cogroups the limit should be lower, but certainly above 3.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46708057
  
    Looks good - thanks for this. I'm going to merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46403384
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/813#discussion_r13901329
  
    --- Diff: python/pyspark/rdd.py ---
    @@ -1324,11 +1324,11 @@ def mapValues(self, f):
             return self.map(map_values_fn, preservesPartitioning=True)
     
         # TODO: support varargs cogroup of several RDDs.
    --- End diff --
    
    This TODO gets removed now, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46400741
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46400995
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-44369458
  
    Hey @douglaz thanks for giving the explanation. This makes a lot of sense... the issue is about compile time type checking because the varargs drops the value type (didn't realize). This will need to exist somewhere, I think it could be something to merge into Spark core or maybe could exist in user libraries. Let me ask around the committers a bit and try to get a consensus.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46144575
  
    Build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46343152
  
    @douglaz if you up-merge this with master I think the tests should pass fine (currently it's not merging cleanly). I'd like to get this merged soon if possible, so let me know! Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45681947
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46660388
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15953/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46654461
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-43663175
  
    To throw another wrench into the Union analogy, there is also the little-used SparkContext#union, which has signatures for both Seq[RDD[T]] and varags RDD[T].


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46148949
  
    Build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45686848
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46386598
  
    @pwendell, merged with latest master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45686850
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15655/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46144451
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-43465440
  
    Thanks for submitting this. Instead of allowing 4 (and maybe 5), users can certainly use the cogroup RDD's constructor to construct cogroups of arbitrary RDDs. If that is inconvenient, perhaps we should think about a cogroup interface that either takes varargs, or just a sequence/list of RDDs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-43659430
  
    It isn't just about lines of code, it is about pollution of code using `asInstanceOf` and runtime errors because of this and wrong pattern matching on Sequences.
    
    Compare this almost-real-code using `cogroup`:
    
    ```scala
    val userHistories = parsedViews.cogroup(parsedBuyOrders, parsedShoppingCarts, parsedSentMails, partitioner=context.partitioner)
      .map(values => {
        val (key, events) = values
        val (groupedViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails) = events
    
        val sentMailsProducts = groupedSentMails.flatMap(_.products)
    
        val validViews = groupedViews.filter(v => !sentMailsProducts.contains(v.productId))
    
        key -> UserHistory(validViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails)
      })
    ```
    
    With this using `CoGroupedRDD`:
    
    ```scala
    // Perhaps there is some mistake here, a RDD may be missing
    val userHistories = new CoGroupedRDD(Seq(parsedViews, parsedBuyOrders, parsedShoppingCarts, parsedSentMails), part=context.partitioner)
      .map(values => {
        val (key, events) = values
      
        // Or the match is wrong here
        val Seq(_groupedViews, _groupedBuyOrders, _groupedShoppingCarts, _groupedSentMails) = events
      
        // Or here we are casting with the wrong type. We'll find out at runtime
        val groupedViews = _groupedViews.asInstanceOf[Seq[UHView]]
        val groupedBuyOrders = _groupedBuyOrders.asInstanceOf[Seq[UHBuyOrder]]
        val groupedShoppingCarts = _groupedShoppingCarts.asInstanceOf[Seq[UHShoppingCartLog]]
        val groupedSentMails = _groupedSentMails.asInstanceOf[Seq[UHSentMail]]
      
        val sentMailsProducts = groupedSentMails.flatMap(_.products)
      
        val validViews = groupedViews.filter(v => !sentMailsProducts.contains(v.productId))
      
        key -> UserHistory(validViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails)
      })
    ```
    
    The second example is clearly more verbose and error-prone.
    
    Comparing `cogroup` with union misses the point:
    - `cogroup` may be called using different types and it keeps them thanks to the tuple signature. With union there is just one type. 
    - `rdd1.union(rdd2).union(rdd3)` works very well and is transparent to the user of the resulting RDD, while `rdd1.cogroup(rdd2).cogroup(rdd3)` will be very different from `rdd1.cogroup(rdd2, rdd3)`. The composition works fine for `union` but for `cogroup` we start to get `Seq[Seq[` and of course we may have performance implications.
    
    A more fair comparison would be with `join` because it also keeps different types and the composition will create tuple of tuples. But in this case I find it very easy and safe to unpack such tuples. It isn't ideal but better than `cogroup` in the same situation. Of course I wouldn't oppose to create a interface for joins with more elements.
    
    But I agree that we should really discuss this. If such operations won't get in main Spark, then external libraries (using implicits) will be created to handle such cases. I think it would be better if Spark could handle such cases without letting the user deal with boilerplate or resorting to external libraries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45186708
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15473/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46126730
  
    The tests should pass now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45183519
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/813#discussion_r13901345
  
    --- Diff: python/pyspark/join.py ---
    @@ -79,15 +79,15 @@ def dispatch(seq):
         return _do_python_join(rdd, other, numPartitions, dispatch)
     
     
    -def python_cogroup(rdd, other, numPartitions):
    -    vs = rdd.map(lambda (k, v): (k, (1, v)))
    -    ws = other.map(lambda (k, v): (k, (2, v)))
    +def python_cogroup(rdds, numPartitions):
    --- End diff --
    
    ah I see - I guess this is an internal API (?) (sorry not super familiar with this part of the code).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46403385
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15866/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46615287
  
    @pwendell, @mateiz, check if everything is fine now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45186706
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/813#discussion_r13901312
  
    --- Diff: python/pyspark/join.py ---
    @@ -79,15 +79,15 @@ def dispatch(seq):
         return _do_python_join(rdd, other, numPartitions, dispatch)
     
     
    -def python_cogroup(rdd, other, numPartitions):
    -    vs = rdd.map(lambda (k, v): (k, (1, v)))
    -    ws = other.map(lambda (k, v): (k, (2, v)))
    +def python_cogroup(rdds, numPartitions):
    --- End diff --
    
    Will this break compatibility for users who were building against the previous API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46400984
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46148950
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15811/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-44480565
  
    I'd be okay adding this, but it can be a bit of a slippery slope because people may then want it for joins, etc as well. But maybe we can just limit it to cogroup right now.
    
    Regarding the pull request though, we should add this API to Python as well. Can you look into what that will take?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46660385
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45183553
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45681520
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45178088
  
    I'm having no luck running the python tests on my machine. I'll try again later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-44554444
  
    I'll take a look at the python interface soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46654310
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-45681955
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/813


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46654474
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/813#discussion_r13901535
  
    --- Diff: python/pyspark/join.py ---
    @@ -79,15 +79,15 @@ def dispatch(seq):
         return _do_python_join(rdd, other, numPartitions, dispatch)
     
     
    -def python_cogroup(rdd, other, numPartitions):
    -    vs = rdd.map(lambda (k, v): (k, (1, v)))
    -    ws = other.map(lambda (k, v): (k, (2, v)))
    +def python_cogroup(rdds, numPartitions):
    --- End diff --
    
    Okay I looked yet again, this entire file is not exposed in e.g. the docs, so I guess this isn't public.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46401560
  
    Hey @douglaz, thanks for updating this. One thing missing here is tests in each of the languages -- please add them so that this code will be tested later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-43430263
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-46144569
  
     Build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/813#issuecomment-43581215
  
    Isn't it possible to just to `new CoGroupedRDD(Seq(rdd1, rdd2, rdd3... rddn))`? That seems like the same number of lines of code as `rdd1.cogroup(rdd2, rdd3...rddn))`.
    
    We have many functions like this, including `union` - I'm not sure we want to create many definitions of each of these functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---