You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by douglaz <gi...@git.apache.org> on 2014/05/18 05:08:31 UTC
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
GitHub user douglaz opened a pull request:
https://github.com/apache/spark/pull/813
SPARK-1868: Users should be allowed to cogroup at least 4 RDDs
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/douglaz/spark more_cogroups
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/813.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #813
----
commit d46e98ed36ff921bfd98570144b588f2dea4a73b
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date: 2014-05-17T23:27:57Z
Allow the cogroup of 4 RDDs
commit a7e6e5a2808b69eb565ce88af4f2d14ef4e43d9c
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date: 2014-05-17T23:46:59Z
Fixed scala style issues
commit 5db9caa858367a73a733357ed611124ab6e4afed
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date: 2014-05-17T23:54:20Z
Fixed spacing
commit 7680860195c8beed7e5bd834f68d68db91f51408
Author: Allan Douglas R. de Oliveira <al...@gmail.com>
Date: 2014-05-18T00:36:15Z
Added java cogroup 4
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45177716
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45183557
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-43579830
Yes, the user can instantiate the RDD and yes this is inconvenient. An interface to do this would be no less inconvenient if it has the same drawbacks (that you need to explicitly convert back the resulting sequences to the original type).
Limiting the user to 3 cogroups is pretty much like limiting tuples to 3 elements. You may have technical reasons for that limit, but it isn't reasonable for practical purposes. You can't just say: if you need a tuple with more than 3 elements, use lists instead.
For tuples the current limit is 22, which is "enough for everyone". For cogroups the limit should be lower, but certainly above 3.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46708057
Looks good - thanks for this. I'm going to merge it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46403384
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/813#discussion_r13901329
--- Diff: python/pyspark/rdd.py ---
@@ -1324,11 +1324,11 @@ def mapValues(self, f):
return self.map(map_values_fn, preservesPartitioning=True)
# TODO: support varargs cogroup of several RDDs.
--- End diff --
This TODO gets removed now, doesn't it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46400741
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46400995
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-44369458
Hey @douglaz thanks for giving the explanation. This makes a lot of sense... the issue is about compile time type checking because the varargs drops the value type (didn't realize). This will need to exist somewhere, I think it could be something to merge into Spark core or maybe could exist in user libraries. Let me ask around the committers a bit and try to get a consensus.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46144575
Build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46343152
@douglaz if you up-merge this with master I think the tests should pass fine (currently it's not merging cleanly). I'd like to get this merged soon if possible, so let me know! Thanks
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45681947
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46660388
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15953/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46654461
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-43663175
To throw another wrench into the Union analogy, there is also the little-used SparkContext#union, which has signatures for both Seq[RDD[T]] and varags RDD[T].
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46148949
Build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45686848
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46386598
@pwendell, merged with latest master.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45686850
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15655/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46144451
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-43465440
Thanks for submitting this. Instead of allowing 4 (and maybe 5), users can certainly use the cogroup RDD's constructor to construct cogroups of arbitrary RDDs. If that is inconvenient, perhaps we should think about a cogroup interface that either takes varargs, or just a sequence/list of RDDs?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-43659430
It isn't just about lines of code, it is about pollution of code using `asInstanceOf` and runtime errors because of this and wrong pattern matching on Sequences.
Compare this almost-real-code using `cogroup`:
```scala
val userHistories = parsedViews.cogroup(parsedBuyOrders, parsedShoppingCarts, parsedSentMails, partitioner=context.partitioner)
.map(values => {
val (key, events) = values
val (groupedViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails) = events
val sentMailsProducts = groupedSentMails.flatMap(_.products)
val validViews = groupedViews.filter(v => !sentMailsProducts.contains(v.productId))
key -> UserHistory(validViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails)
})
```
With this using `CoGroupedRDD`:
```scala
// Perhaps there is some mistake here, a RDD may be missing
val userHistories = new CoGroupedRDD(Seq(parsedViews, parsedBuyOrders, parsedShoppingCarts, parsedSentMails), part=context.partitioner)
.map(values => {
val (key, events) = values
// Or the match is wrong here
val Seq(_groupedViews, _groupedBuyOrders, _groupedShoppingCarts, _groupedSentMails) = events
// Or here we are casting with the wrong type. We'll find out at runtime
val groupedViews = _groupedViews.asInstanceOf[Seq[UHView]]
val groupedBuyOrders = _groupedBuyOrders.asInstanceOf[Seq[UHBuyOrder]]
val groupedShoppingCarts = _groupedShoppingCarts.asInstanceOf[Seq[UHShoppingCartLog]]
val groupedSentMails = _groupedSentMails.asInstanceOf[Seq[UHSentMail]]
val sentMailsProducts = groupedSentMails.flatMap(_.products)
val validViews = groupedViews.filter(v => !sentMailsProducts.contains(v.productId))
key -> UserHistory(validViews, groupedBuyOrders, groupedShoppingCarts, groupedSentMails)
})
```
The second example is clearly more verbose and error-prone.
Comparing `cogroup` with union misses the point:
- `cogroup` may be called using different types and it keeps them thanks to the tuple signature. With union there is just one type.
- `rdd1.union(rdd2).union(rdd3)` works very well and is transparent to the user of the resulting RDD, while `rdd1.cogroup(rdd2).cogroup(rdd3)` will be very different from `rdd1.cogroup(rdd2, rdd3)`. The composition works fine for `union` but for `cogroup` we start to get `Seq[Seq[` and of course we may have performance implications.
A more fair comparison would be with `join` because it also keeps different types and the composition will create tuple of tuples. But in this case I find it very easy and safe to unpack such tuples. It isn't ideal but better than `cogroup` in the same situation. Of course I wouldn't oppose to create a interface for joins with more elements.
But I agree that we should really discuss this. If such operations won't get in main Spark, then external libraries (using implicits) will be created to handle such cases. I think it would be better if Spark could handle such cases without letting the user deal with boilerplate or resorting to external libraries.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45186708
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15473/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46126730
The tests should pass now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45183519
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/813#discussion_r13901345
--- Diff: python/pyspark/join.py ---
@@ -79,15 +79,15 @@ def dispatch(seq):
return _do_python_join(rdd, other, numPartitions, dispatch)
-def python_cogroup(rdd, other, numPartitions):
- vs = rdd.map(lambda (k, v): (k, (1, v)))
- ws = other.map(lambda (k, v): (k, (2, v)))
+def python_cogroup(rdds, numPartitions):
--- End diff --
ah I see - I guess this is an internal API (?) (sorry not super familiar with this part of the code).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46403385
All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15866/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46615287
@pwendell, @mateiz, check if everything is fine now.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45186706
Merged build finished.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/813#discussion_r13901312
--- Diff: python/pyspark/join.py ---
@@ -79,15 +79,15 @@ def dispatch(seq):
return _do_python_join(rdd, other, numPartitions, dispatch)
-def python_cogroup(rdd, other, numPartitions):
- vs = rdd.map(lambda (k, v): (k, (1, v)))
- ws = other.map(lambda (k, v): (k, (2, v)))
+def python_cogroup(rdds, numPartitions):
--- End diff --
Will this break compatibility for users who were building against the previous API?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46400984
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46148950
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15811/
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-44480565
I'd be okay adding this, but it can be a bit of a slippery slope because people may then want it for joins, etc as well. But maybe we can just limit it to cogroup right now.
Regarding the pull request though, we should add this API to Python as well. Can you look into what that will take?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46660385
Merged build finished. All automated tests passed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45183553
Merged build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45681520
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45178088
I'm having no luck running the python tests on my machine. I'll try again later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by douglaz <gi...@git.apache.org>.
Github user douglaz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-44554444
I'll take a look at the python interface soon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46654310
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-45681955
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/813
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46654474
Merged build started.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/813#discussion_r13901535
--- Diff: python/pyspark/join.py ---
@@ -79,15 +79,15 @@ def dispatch(seq):
return _do_python_join(rdd, other, numPartitions, dispatch)
-def python_cogroup(rdd, other, numPartitions):
- vs = rdd.map(lambda (k, v): (k, (1, v)))
- ws = other.map(lambda (k, v): (k, (2, v)))
+def python_cogroup(rdds, numPartitions):
--- End diff --
Okay I looked yet again, this entire file is not exposed in e.g. the docs, so I guess this isn't public.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46401560
Hey @douglaz, thanks for updating this. One thing missing here is tests in each of the languages -- please add them so that this code will be tested later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-43430263
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-46144569
Build triggered.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] spark pull request: SPARK-1868: Users should be allowed to cogroup...
Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/813#issuecomment-43581215
Isn't it possible to just to `new CoGroupedRDD(Seq(rdd1, rdd2, rdd3... rddn))`? That seems like the same number of lines of code as `rdd1.cogroup(rdd2, rdd3...rddn))`.
We have many functions like this, including `union` - I'm not sure we want to create many definitions of each of these functions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---