You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mccheah <gi...@git.apache.org> on 2015/02/16 22:52:14 UTC

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

GitHub user mccheah opened a pull request:

    https://github.com/apache/spark/pull/4634

    [SPARK-5843] Allowing map-side combine to be specified in Java.

    Specifically, when calling JavaPairRDD.combineByKey(), there is a new
    five-parameter method that exposes the map-side-combine boolean as the
    fifth parameter.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mccheah/spark pair-rdd-map-side-combine

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4634.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4634
    
----
commit 6ddd72959dccf5a84d936e716b0bfb7952473977
Author: mcheah <mc...@palantir.com>
Date:   2015-02-16T21:51:42Z

    [SPARK-5843] Allowing map-side combine to be specified in Java.
    
    Specifically, when calling JavaPairRDD.combineByKey(), there is a new
    five-parameter method that exposes the map-side-combine boolean as the
    fifth parameter.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74627399
  
    Are there ever situations where `combineByKey` should be used instead of `aggregateByKey`?  I tend to think of `combineByKey` as an internal API that's exposed for historical reasons.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74580824
  
    groupBy and reduceByKey in the Scala API are actually just convenience methods that call through to combineByKey with parameters that make sense. Given that, perhaps just having the single 6-parameter method in the Java API makes sense. combineByKey will be used when customization/configuration is required beyond what reduceByKey and groupBy offers (but those latter two methods exist for convenience and readability, and rightly so).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83189878
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28834/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-77903115
  
    Serializer seems ok to add.
    
    One thing I am not sure about is the mapSideCombine thing -- I'm never a fan of that parameter even though I added it myself, for the following reasons:
    
    1. mapSideCombine is a MR term used in Hive that doesn't mean much outside of MR. A more proper name is partialAggregation.
    2. The underlying implementation should be able to avoid partial aggregation if it finds that partial aggregation is expensive (i.e. after trying 10000 records, check whether the hash table size is less than a specific threshold). It is one of the things we can easily auto tune. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83189850
  
      [Test build #28834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28834/consoleFull) for   PR 4634 at commit [`3ce7deb`](https://github.com/apache/spark/commit/3ce7debcac9a9b68ef8aba3687c4e43e1c1ab45a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74591345
  
      [Test build #27589 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27589/consoleFull) for   PR 4634 at commit [`7455c7a`](https://github.com/apache/spark/commit/7455c7ac69f4aef474aeaae55923b7f97a173aec).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r26056981
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala ---
    @@ -233,18 +235,44 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
       def combineByKey[C](createCombiner: JFunction[V, C],
         mergeValue: JFunction2[C, V, C],
         mergeCombiners: JFunction2[C, C, C],
    -    partitioner: Partitioner): JavaPairRDD[K, C] = {
    +    partitioner: Partitioner,
    +    mapSideCombine: Boolean,
    +    serializer: Serializer): JavaPairRDD[K, C] = {
    --- End diff --
    
    looks ok. it would be better to add serializer to the doc if possible.
    
    also style wise, can you indent 4 spaces for the function parameters?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74581613
  
      [Test build #27589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27589/consoleFull) for   PR 4634 at commit [`7455c7a`](https://github.com/apache/spark/commit/7455c7ac69f4aef474aeaae55923b7f97a173aec).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83158020
  
    I'll get around to this today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74710776
  
    Is there any reason to control map side combine? that seems to be the original motivation. @mccheah ? Any third opinions? Although I didn't perceive this method as something that should have been hidden, I agree with the logic, _if_ there's really no good reason to call it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-76772994
  
    I don't know if there's an explicit guideline, but I assume the goal is to provide as much uniformity across languages as makes sense, yes. 
    
    What if someone really does want to make an initial value per key a function of one of the input values? If `combineByKey` goes away, you can't do that anymore, it seems. Or is the argument that this just is never needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-78262325
  
    @rxin I agree about the name itself, though that by itself is minor.
    
    Yeah it seems like, if it were done again, this param wouldn't maybe exist. Given that it does, what do you think about adding it to the Java API in the name of consistency? Or is it more important to discourage its use?
    
    That is, would you merge this or no? I don't feel strongly but am hoping to resolve the question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r26704396
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala ---
    @@ -233,18 +235,44 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
       def combineByKey[C](createCombiner: JFunction[V, C],
         mergeValue: JFunction2[C, V, C],
         mergeCombiners: JFunction2[C, C, C],
    -    partitioner: Partitioner): JavaPairRDD[K, C] = {
    +    partitioner: Partitioner,
    +    mapSideCombine: Boolean,
    +    serializer: Serializer): JavaPairRDD[K, C] = {
    --- End diff --
    
    It's weird because we don't specify the serializer in the docs for PairRDDFunctions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r24782348
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala ---
    @@ -233,18 +235,44 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
       def combineByKey[C](createCombiner: JFunction[V, C],
         mergeValue: JFunction2[C, V, C],
         mergeCombiners: JFunction2[C, C, C],
    -    partitioner: Partitioner): JavaPairRDD[K, C] = {
    +    partitioner: Partitioner,
    +    mapSideCombine: Boolean,
    +    serializer: Serializer): JavaPairRDD[K, C] = {
    --- End diff --
    
    Does the scaladoc need a note about the new serializer then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-81977217
  
    Ok upon further thought, I think we should just make it consistent across languages for now. In the future, we can have the default to be "auto tune".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-81977973
  
    @mccheah can you make the couple minor changes I suggested? Other than that, this change lgtm.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74750902
  
    I don't entirely understand the advantage of `combineByKey` without a combiner vs. `groupByKey`.  Can't whatever aggregation function that would be used inside the `combineByKey` be placed in a `map` after the `groupByKey`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74583452
  
      [Test build #27580 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27580/consoleFull) for   PR 4634 at commit [`6ddd729`](https://github.com/apache/spark/commit/6ddd72959dccf5a84d936e716b0bfb7952473977).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74704920
  
    @srowen `aggregateByKey` will already make a copy of the object for each key so a mutable zero value is fine.  The `seqOp` argument to `aggregateByKey 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74579047
  
    Can we also allow map-side-combine to be specified without specifying the partitioner?
    
    In general since Java doesn't offer the luxury of default-values the number of methods can explode if we allow every combination of parameters that are in the Scala API. Not sure to what extent we want to be adding methods to the Java API to mirror Scala's API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r24782383
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -25,17 +25,17 @@
     import java.util.concurrent.*;
     
     import org.apache.spark.input.PortableDataStream;
    +import org.apache.spark.rdd.RDD;
    +import org.apache.spark.serializer.KryoSerializer;
    +import scala.collection.JavaConversions;
     import scala.Tuple2;
     import scala.Tuple3;
     import scala.Tuple4;
     
    -import com.google.common.collect.Iterables;
    -import com.google.common.collect.Iterators;
    -import com.google.common.collect.Lists;
    -import com.google.common.collect.Maps;
     import com.google.common.base.Throwables;
     import com.google.common.base.Optional;
     import com.google.common.base.Charsets;
    +import com.google.common.collect.*;
    --- End diff --
    
    I'd reverse this ... I don't know whether we have a strong convention either way about * and _ imports but I wouldn't change it just to change it. Generally * is avoided in Java.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-76991315
  
    Yeah, the argument is that's never needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74752839
  
    We want to take advantage of the distributed reduce functionality of combineByKey when computing the other aggregation metrics as well. Is this not lost if we do a map on each list created by groupByKey?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-76252771
  
    @pwendell Do you have any opinions on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74774029
  
    If I understand correctly, the need is for an operator with an optional map-side combine.  The 
    alternative is to force the user to choose between `groupByKey` and `aggregateByKey`.  @pwendell maybe you have an opinion on whether we should expose something like this?
    
    If we do choose to expose this, I think it would be better to allow specifying no map-side combine in `aggregateByKey` than to encourage use of `combineByKey`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83220200
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28838/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83196904
  
      [Test build #28838 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28838/consoleFull) for   PR 4634 at commit [`5c58319`](https://github.com/apache/spark/commit/5c5831941613afd3226f4805f3e574a2b7e63059).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-76740039
  
    If our language compatibility guideline mandate that all public RDD methods should be in all languages, adding `combineByKey` I can't really argue against adding it to Java.
    
    That said, for the future, I think it's probably worth deprecating `combineByKey` and adding any needed functionality to `aggregateByKey`.  This part probably makes sense in a different JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] [API] Allowing map-side combine t...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4634


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r26056992
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala ---
    @@ -233,18 +235,44 @@ class JavaPairRDD[K, V](val rdd: RDD[(K, V)])
       def combineByKey[C](createCombiner: JFunction[V, C],
         mergeValue: JFunction2[C, V, C],
         mergeCombiners: JFunction2[C, C, C],
    -    partitioner: Partitioner): JavaPairRDD[K, C] = {
    +    partitioner: Partitioner,
    +    mapSideCombine: Boolean,
    +    serializer: Serializer): JavaPairRDD[K, C] = {
         implicit val ctag: ClassTag[C] = fakeClassTag
         fromRDD(rdd.combineByKey(
           createCombiner,
           mergeValue,
           mergeCombiners,
    -      partitioner
    +      partitioner,
    +      mapSideCombine,
    +      serializer
         ))
       }
     
       /**
    -   * Simplified version of combineByKey that hash-partitions the output RDD.
    +   * Generic function to combine the elements for each key using a custom set of aggregation
    +   * functions. Turns a JavaPairRDD[(K, V)] into a result of type JavaPairRDD[(K, C)], for a
    +   * "combined type" C * Note that V and C can be different -- for example, one might group an
    +   * RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users provide three
    +   * functions:
    +   *
    +   * - `createCombiner`, which turns a V into a C (e.g., creates a one-element list)
    +   * - `mergeValue`, to merge a V into a C (e.g., adds it to the end of a list)
    +   * - `mergeCombiners`, to combine two C's into a single one.
    +   *
    +   * In addition, users can control the partitioning of the output RDD. This method automatically
    +   * uses map-side aggregation in shuffling the RDD.
    +   */
    +  def combineByKey[C](createCombiner: JFunction[V, C],
    +    mergeValue: JFunction2[C, V, C],
    --- End diff --
    
    4 space indent here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/4634#issuecomment-74732535

There is a case where map-side-combine is actually not the right thing to do in some of my workflows. map-side-combine makes sense if the overall amount of data is shrinking as the aggregation is being computed. If the overall size of the data is the same or larger then it just creates GC stress when the amount of data being shuffled will be the same regardless if map-side-combine is used or not.

groupBy is the classic example of this, and I have an aggregation function that can potentially do a group by along with multiple other aggregations at the same time. So we can't use groupBy because we're computing both the per-key-list and other statistics (say, an average per key) at the same time, which requires a custom aggregation function. But if we can't use groupBy, then I'm forced to use map-side-combine right now and I would like to specify turning that off if it doesn't make sense.

cc @MingyuKim

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r26535681
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -25,17 +25,17 @@
     import java.util.concurrent.*;
     
     import org.apache.spark.input.PortableDataStream;
    +import org.apache.spark.rdd.RDD;
    --- End diff --
    
    let's sort the imports here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r26535654
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala ---
    @@ -20,6 +20,8 @@ package org.apache.spark.api.java
     import java.util.{Comparator, List => JList, Map => JMap}
     import java.lang.{Iterable => JIterable}
     
    +import org.apache.spark.serializer.Serializer
    --- End diff --
    
    mind moving this import to the correct place?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74579515
  
    Yes, there is also a `Serializer` argument exposed in the Scala API, and not yet in the Java API. My general theory is to expose one method in Java with all parameters, and then only expose convenience methods that set defaults which seem like they'll be frequently used. Hence I could see adding the Serializer to the method you've added, to avoid having to add another method later if someone wants to set that.
    
    I haven't even surveyed the other similar by-key methods in the Java API to see what they do, to try to be consistent. That might be a good quick exercise, if you have a minute.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74578758
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-80680853
  
    Going once going twice for final comments as I'd like to resolve this one and move on. Right now it's two mild thumbs up, one thumbs down, and a neutral (?) from @rxin who is probably most authoritative here. There's not obvious consensus to make this change, and unless that changes or I misunderstand, I think we should soon resolve as WontFix. I think there are not-too-bad ways to access this from Java anyway if it matters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74578694
  
      [Test build #27580 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27580/consoleFull) for   PR 4634 at commit [`6ddd729`](https://github.com/apache/spark/commit/6ddd72959dccf5a84d936e716b0bfb7952473977).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74591357
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27589/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74583455
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27580/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74643730
  
    @sryza The logical difference is small. `aggregateByKey` is for when you have a single immutable 'zero' value to start from for each key. `combineByKey` lets this be a function, and of the first value. That is useful, for example, if I were trying to combine into a `mutable.Set` since I need to make a different one for each key. Whether or not it was worth different methods in retrospect, I don't know, but that much seems OK since they're there already.
    
    The rest of the difference is just that `combineByKey` exposes control over map side combine and serializer. That is a little more internal. If there is clear evidence this should have been a developer API then I'd say at least we can not open it up in the Java API. But is that clear? Otherwise I'd say, well, let's at least shoot for consistency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on the pull request:

https://github.com/apache/spark/pull/4634#issuecomment-74756966

You lose the parallelism that's inherent in computing the reduce as a parallel operation, as opposed to computing it on a list in a single task.

For more context, I'm exposing an aggregation semantic to users where they can specify any number of arbitrary aggregations to be computed on a dataset, and group-by is only one of those possible aggregations. We take all of the aggregations and compute them all in the same combineByKey call. So we can't rely on the user always desiring a group-by to be called; although I could special-case and introspect the aggregations requested by the user to see if group-by is among those aggregations. However what if the user wants aggregation on another metric where map-side-combine is once again not optimal? It seems better to allow my end user to specify map-side-combine toggling and I'll just pass that through to the combine-by-key call.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by mccheah <gi...@git.apache.org>.

Github user mccheah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4634#discussion_r24786524
  
    --- Diff: core/src/test/java/org/apache/spark/JavaAPISuite.java ---
    @@ -25,17 +25,17 @@
     import java.util.concurrent.*;
     
     import org.apache.spark.input.PortableDataStream;
    +import org.apache.spark.rdd.RDD;
    +import org.apache.spark.serializer.KryoSerializer;
    +import scala.collection.JavaConversions;
     import scala.Tuple2;
     import scala.Tuple3;
     import scala.Tuple4;
     
    -import com.google.common.collect.Iterables;
    -import com.google.common.collect.Iterators;
    -import com.google.common.collect.Lists;
    -import com.google.common.collect.Maps;
     import com.google.common.base.Throwables;
     import com.google.common.base.Optional;
     import com.google.common.base.Charsets;
    +import com.google.common.collect.*;
    --- End diff --
    
    Yeah, generally I don't like it either. I do see a bunch of import *s in this particular unit test so I went with that assuming it's the convention, but that practice in itself is debatable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/4634#issuecomment-76695972

So, it seems like there's an argument here that `combineByKey` doesn't add much over `aggregateByKey`. I agree, although it is slightly more general, letting you make an initial value as a function of an input, instead of providing a zero value. But `combineByKey` has all of the advanced options like `mapSideCombine`.

So if you just need `aggregateByKey`, but do need to control advanced settings, you have to go down a step to use `combineByKey`. You have to provide a function to make a zero value, instead of a zero value, which isn't a big deal. Of course, I don't think the API can be changed in the short term. Removing `combineByKey` would lose one little bit of control too: zero value as a function, and as it happens now, control over things like map side combine.

We're left with an argument for API consistency between Java and Scala, which is compelling. that is, they should at least match, irrespective of what changes may happen later.

`groupByKey` vs `aggregateByKey` seems like a slightly different question that results in an alternative suggestions: add this `mapSideCombine` flag to `aggregateByKey`.

1. Don't change Scala API. Make `combineByKey` consistent in Java API and expose `mapSideCombine`
2. Add new optional param to Scala `aggregateByKey`. Add to Java `aggregateByKey` as well.

I slightly prefer 1 because it's a strictly smaller change and leaves things more API consistent. It seems like purpose of 2 is to fix by removing a need for `combineByKey` to exist, but, it does, so that's moot to me.

I'd like to proceed with this change, then. It passes tests and does not affect the API. I'd like to wait a couple days for @pwendell or @rxin since it has a core API question.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83220182
  
      [Test build #28838 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28838/consoleFull) for   PR 4634 at commit [`5c58319`](https://github.com/apache/spark/commit/5c5831941613afd3226f4805f3e574a2b7e63059).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74756176
  
    What exactly do you mean by the distributed reduce functionality?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74709911
  
    If there's no conceivable reason why someone would want to use `combineByKey` isn't including it in the Java API just going to confuse developers? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-77862472
  
    @pwendell @rxin I'd like to merge this, and while I'm all but sure the API change question is OK, I'd feel better if a maintainer could give it a look.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-80753791
  
    Give me until Monday. Want to think a little bit more about this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-83170584
  
      [Test build #28834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28834/consoleFull) for   PR 4634 at commit [`3ce7deb`](https://github.com/apache/spark/commit/3ce7debcac9a9b68ef8aba3687c4e43e1c1ab45a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74705635
  
    Ah, true that. Well, there's not much difference at all eh. `combineByKey` is the lowest-level method and its separate utility is marginal; I suppose it gives access to exactly these settings like map-side combine. It's public, for better or worse, and doesn't do much harm other than taking up room in the API. In the name of consistency it seems OK to make it available equally in the Java API. If starting over, yeah, I'd question why both of these exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-5843] Allowing map-side combine to be s...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/4634#issuecomment-74711187
  
    `aggregateByKey` lets you control the map-side combine with its `seqOp` argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org