You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sryza <gi...@git.apache.org> on 2014/07/16 23:09:56 UTC

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

GitHub user sryza opened a pull request:

    https://github.com/apache/spark/pull/1447

    SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section...

    ...s of CoGroupedRDD and PairRDDFunctions
    
    This also removes an unnecessary tuple creation in cogroup.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sryza/spark sandy-spark-2519-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1447.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1447
    
----
commit 906e6b362db11a4ac5a3f8096668c6f968dc48aa
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-07-16T21:00:17Z

    SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042054
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -589,9 +584,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("Default partitioner cannot partition array keys.")
         }
         val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)
    -    cg.mapValues { case Seq(vs, ws) =>
    -      (vs.asInstanceOf[Seq[V]], ws.asInstanceOf[Seq[W]])
    -    }
    +    cg.mapValues { pair => pair.asInstanceOf[(Seq[V], Seq[W])] }
    --- End diff --
    
    Same here, should be Iterable (although the types work okay now, it's confusing)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037123
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -430,11 +430,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
        */
       def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner)
           : RDD[(K, (Option[V], W))] = {
    -    this.cogroup(other, partitioner).flatMapValues { case (vs, ws) =>
    -      if (vs.isEmpty) {
    -        ws.map(w => (None, w))
    +    this.cogroup(other, partitioner).flatMapValues { pair =>
    +      if (pair._1.isEmpty) {
    --- End diff --
    
    here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042631
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    This I'm sure has been done a million times, but one more for good luck -- here is a `map { case (x, y) => x + y}` versus `foreach { xy => xy._1 + xy._2}`:
    http://www.diffchecker.com/vtv6cptx
    
    No new virtual function calls, but one new branch (trivially branch predictable) and one new throw (which will never be invoked), and several store/loads. (Perhaps I'm misreading the bytecode, but isn't `astore_2` followed by `aload_2` a no-op?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1447


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042061
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -604,11 +597,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("Default partitioner cannot partition array keys.")
         }
         val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
    -    cg.mapValues { case Seq(vs, w1s, w2s) =>
    -      (vs.asInstanceOf[Seq[V]],
    -       w1s.asInstanceOf[Seq[W1]],
    -       w2s.asInstanceOf[Seq[W2]])
    -    }
    +    cg.mapValues { seq  => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2])] }
    --- End diff --
    
    Same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15044028
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
             val index = p.getPartition(key)
             def process(it: Iterator[(K, V)]): Seq[V] = {
               val buf = new ArrayBuffer[V]
    -          for ((k, v) <- it if k == key) {
    --- End diff --
    
    Wait, actually, my understanding of this loop is that it's iterating over every record within the partition.  Am I missing something? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042897
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    There are all kinds of stuff that's really hard to profile at runtime. For example, the JVM might decide not to JIT the code because it is longer. It might decide not to do it because of exceptions. Branch prediction might stop working because it has too many branches. Etc.
    
    Again, I'm only in favor of this change because it is small, and it might have impact. If it is a big ugly change, I would be against it without profiling.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042905
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("Default partitioner cannot partition array keys.")
         }
         val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
    -    cg.mapValues { case Seq(vs, w1s, w2s, w3s) =>
    -      (vs.asInstanceOf[Seq[V]],
    -        w1s.asInstanceOf[Seq[W1]],
    -        w2s.asInstanceOf[Seq[W2]],
    -        w3s.asInstanceOf[Seq[W3]])
    -    }
    +    cg.mapValues { seq  => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] }
    --- End diff --
    
    And yeah as Reynold pointed out, this won't actually cast. A Seq is not a tuple4, you need to make a new one. In this case it might be okay to leave the code as is, because the non-pattern-matching way will be hairier. Basically you'd have to do `seq => (seq(0).asInstanceOf[Seq[V]], seq(1).asInstanceOf[Seq[W1]], etc)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15043849
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("Default partitioner cannot partition array keys.")
         }
         val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
    -    cg.mapValues { case Seq(vs, w1s, w2s, w3s) =>
    -      (vs.asInstanceOf[Seq[V]],
    -        w1s.asInstanceOf[Seq[W1]],
    -        w2s.asInstanceOf[Seq[W2]],
    -        w3s.asInstanceOf[Seq[W3]])
    -    }
    +    cg.mapValues { seq  => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] }
    --- End diff --
    
    Ah, right.  I'll leave these as they are for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49272425
  
    QA tests have started for PR 1447. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037689
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
             val index = p.getPartition(key)
             def process(it: Iterator[(K, V)]): Seq[V] = {
               val buf = new ArrayBuffer[V]
    -          for ((k, v) <- it if k == key) {
    --- End diff --
    
    Ah, misread


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15043860
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    Also on removing the calls to _1,  these values should be in cache, so accesses will be really fast.  Will hold off on these for now, but happy to make the change if y'all want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042857
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    I think we tested this in the past and saw a difference, perhaps because it can now throw. But I don't remember the details. Weird that it's also doing those stores.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037121
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -413,11 +413,11 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
        * partition the output RDD.
        */
       def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] = {
    -    this.cogroup(other, partitioner).flatMapValues { case (vs, ws) =>
    -      if (ws.isEmpty) {
    -        vs.map(v => (v, None))
    +    this.cogroup(other, partitioner).flatMapValues { pair =>
    +      if (pair._2.isEmpty) {
    --- End diff --
    
    save the key and value, i.e.
    ```scala
    this.cogroup(other, partitioner).flatMapValues { pair =>
      val vs = pair._1
      val ws = pair._2
      ... same old code ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15044062
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
             val index = p.getPartition(key)
             def process(it: Iterator[(K, V)]): Seq[V] = {
               val buf = new ArrayBuffer[V]
    -          for ((k, v) <- it if k == key) {
    --- End diff --
    
    Actually yes my bad :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15038052
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    If we're using functional style anyway, do we really expect to see any sort of remotely noticeable gain from doing these kinds of optimizations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042052
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("Default partitioner cannot partition array keys.")
         }
         val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
    -    cg.mapValues { case Seq(vs, w1s, w2s, w3s) =>
    -      (vs.asInstanceOf[Seq[V]],
    -        w1s.asInstanceOf[Seq[W1]],
    -        w2s.asInstanceOf[Seq[W2]],
    -        w3s.asInstanceOf[Seq[W3]])
    -    }
    +    cg.mapValues { seq  => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] }
    --- End diff --
    
    This should actually return Iterable, not Seq


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037091
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    can we save the key into a variable here so we don't call _1 twice?
    
    I was looking at the bytecode generated by pattern matching versus non-pattern-matching, and the main difference is that we bypass some unnecessary branches:
    {code}
        final def apply(x0$1: Tuple2): Unit = {
          case <synthetic> val x1: Tuple2 = x0$1;
          case4(){
            if (x1.ne(null))
              {
                val k: Int = x1._1$mcI$sp();
                val v: Int = x1._2$mcI$sp();
                matchEnd3({
                  scala.this.Predef.println(scala.Int.box(k.+(v)));
                  scala.runtime.BoxedUnit.UNIT
                })
              }
            else
              case5()
          };
          case5(){
            matchEnd3(throw new MatchError(x1))
          };
          matchEnd3(x: runtime.BoxedUnit){
            ()
          }
    {code}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15038340
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    It's really hard to tell. I think it is really hard to tell when doing microbenchmarks because branch prediction takes care of the extra branches. But I don't know how this impacts real runtime in a real workload without a lot more instrumentation. My opinion is since it is a small change, it's ok to just do it this way. If it is a big refactoring, that's a very different story and deserves more profiling. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49287721
  
    QA results for PR 1447:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16773/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49407225
  
    QA tests have started for PR 1447. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16821/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49401225
  
    This looks okay to me as is, @rxin what do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15042032
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    --- End diff --
    
    Removing the call to _1 isn't hugely helpful IMO; accessor methods get inlined very fast. However, pattern matching adds a whole bunch of potential throw sites and branches and things like that, and we've noticed problems in the past.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49415317
  
    QA results for PR 1447:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16821/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037164
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -712,8 +701,8 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
             val index = p.getPartition(key)
             def process(it: Iterator[(K, V)]): Seq[V] = {
               val buf = new ArrayBuffer[V]
    -          for ((k, v) <- it if k == key) {
    --- End diff --
    
    we can probably revert this change since this only works on the partitions rather than per record


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037098
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -216,17 +216,17 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
     
         def reducePartition(iter: Iterator[(K, V)]): Iterator[JHashMap[K, V]] = {
           val map = new JHashMap[K, V]
    -      iter.foreach { case (k, v) =>
    -        val old = map.get(k)
    -        map.put(k, if (old == null) v else func(old, v))
    +      iter.foreach { pair =>
    +        val old = map.get(pair._1)
    +        map.put(pair._1, if (old == null) pair._2 else func(old, pair._2))
           }
           Iterator(map)
         }
     
         def mergeMaps(m1: JHashMap[K, V], m2: JHashMap[K, V]): JHashMap[K, V] = {
    -      m2.foreach { case (k, v) =>
    -        val old = m1.get(k)
    -        m1.put(k, if (old == null) v else func(old, v))
    +      m2.foreach { pair =>
    +        val old = m1.get(pair._1)
    --- End diff --
    
    again, create a variable to save the key


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49540356
  
    Merging this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1447#discussion_r15037134
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala ---
    @@ -571,12 +571,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
           throw new SparkException("Default partitioner cannot partition array keys.")
         }
         val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)
    -    cg.mapValues { case Seq(vs, w1s, w2s, w3s) =>
    -      (vs.asInstanceOf[Seq[V]],
    -        w1s.asInstanceOf[Seq[W1]],
    -        w2s.asInstanceOf[Seq[W2]],
    -        w3s.asInstanceOf[Seq[W3]])
    -    }
    +    cg.mapValues { seq  => seq.asInstanceOf[(Seq[V], Seq[W1], Seq[W2], Seq[W3])] }
    --- End diff --
    
    extra space here. Does the type casting work?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by sryza <gi...@git.apache.org>.

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49407083
  
    Upmerged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49267874
  
    QA tests have started for PR 1447. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16771/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-2519 part 2. Remove pattern matching on ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1447#issuecomment-49277859
  
    QA results for PR 1447:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16771/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---