You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by ash211 <gi...@git.apache.org> on 2014/04/09 16:40:06 UTC

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

GitHub user ash211 opened a pull request:

    https://github.com/apache/spark/pull/369

    SPARK-1063 Add .sortBy(f) method on RDD

    This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.
    
    I think this is ready for merging.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ash211/spark sortby

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #369
    
----
commit ca4490da536578ef4650039b099db8dafc9d6b66
Author: Andrew Ash <an...@andrewash.com>
Date:   2014-01-24T08:26:55Z

    Add .sortBy(f) method on RDD

commit 0f685fd17584061d4b18419234bedb79843a0813
Author: Andrew Ash <an...@andrewash.com>
Date:   2014-02-14T06:11:27Z

    Merge remote-tracking branch 'origin/master' into sortby
    
    Conflicts:
    	core/src/main/scala/org/apache/spark/rdd/RDD.scala
    	core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

commit 7db3e849c5a9e4a3189ea594e349835cef6d307e
Author: Andrew Ash <an...@andrewash.com>
Date:   2014-02-14T06:27:06Z

    Support ascending and numPartitions params in sortBy()

commit 381eef23f59a44b0555de9bb63fc8e598595ef32
Author: Andrew Ash <an...@andrewash.com>
Date:   2014-02-14T06:32:19Z

    Correct silly typo

commit 8c53298cfeebcba7e08ef8c586816e7513daf11b
Author: Andrew Ash <an...@andrewash.com>
Date:   2014-02-25T00:19:02Z

    Actually use ascending and numPartitions parameters

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40031981
  
    That gets me closer, but it looks like you then have to pass all the implicit parameters or none of them.  I tried passing in the ord parameter by name as well, but that still didn't work.
    
    Any other ideas?
    
    Personally I would have encoded my Ordering requirement as the output of my sortBy function rather than passing in a key function as well as an ordering on that key.  I can see that some people might prefer to have both if they're re-using an Ordering from elsewhere.
    
    ```
    [error] /Users/aash/git/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:556: not enough arguments for method sortBy: (implicit ord: Ordering[Person], implicit ctag: scala.reflect.ClassTag[Person])org.apache.spark.rdd.RDD[String].
    [error] Unspecified value parameter ctag.
    [error]     assert(data.sortBy(parse, false, 2)(NameOrdering) === nameOrdered)
    [error]
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43569435
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621340
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala ---
    @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
         rdd.setName(name)
         this
       }
    +
    +  /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    +    import scala.collection.JavaConverters._
    +    def fn = (x: T) => f.call(x)
    +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    --- End diff --
    
    I think the problem here is that Ordering in this scope is just scala's Ordering


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-39971959
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-39976772
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-46348842
  
    There was a conflict that I had to merge manually. Take a look at master to make sure everything is ok. I did compile and ran a couple things.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43572005
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42904711
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14915/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42905001
  
    I'll say it again: When you are working on preparing a PR, you're better off rebasing than merging.
    
    within a clone of your github repo:
    
    git pull --rebase git@github.com:apache/spark.git master
    git push +YourBranchName



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42906068
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40267040
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621569
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala ---
    @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
         rdd.setName(name)
         this
       }
    +
    +  /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    +    import scala.collection.JavaConverters._
    +    def fn = (x: T) => f.call(x)
    +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    --- End diff --
    
    Thank you!  I'm not incredibly familiar with Java/Scala interactions so
    would appreciate the assistance.  Also, didn't know Scala had an Ordering.
    
    Using Guava for the default comparator was based on the
    `sortByKey(Boolean)` method in `JavaPairRDD.scala`
    
    
    On Tue, May 13, 2014 at 10:35 PM, Reynold Xin <no...@github.com>wrote:
    
    > In core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala:
    >
    > > @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
    > >      rdd.setName(name)
    > >      this
    > >    }
    > > +
    > > +  /**
    > > +   * Return this RDD sorted by the given key function.
    > > +   */
    > > +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    > > +    import scala.collection.JavaConverters._
    > > +    def fn = (x: T) => f.call(x)
    > > +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    >
    > I will submit a pr to fix these.
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/369/files#r12621500>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42391094
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43052151
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40036252
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13376621
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    why not using context bound?  like `def sortBy[K : Ordering : ClassTag]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13419732
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    If I understand what you are claiming, then I think you are mistaken.  There is only one implicit parameter list, though, so I don't think you can supply an argument for only one of the implicits:
    ```
    scala> def foo[A: ClassTag](implicit ord: Ordering[A]) = (implicitly[ClassTag[A]], ord)
    foo: [A](implicit evidence$1: scala.reflect.ClassTag[A], implicit ord: scala.math.Ordering[A])(scala.reflect.ClassTag[A], scala.math.Ordering[A])
    
    scala> foo[Int]
    res1: (scala.reflect.ClassTag[Int], scala.math.Ordering[Int]) = (Int,scala.math.Ordering$Int$@5e10a811)
    
    scala> foo[Int](scala.reflect.classTag[Int], Ordering[Int])
    res2: (scala.reflect.ClassTag[Int], scala.math.Ordering[Int]) = (Int,scala.math.Ordering$Int$@5e10a811)
    
    scala> foo[Int](scala.reflect.classTag[Int], Ordering[Int].reverse)
    res3: (scala.reflect.ClassTag[Int], scala.math.Ordering[Int]) = (Int,scala.math.Ordering$$anon$4@7293296)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42906072
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-46396473
  
    Took a look through and didn't see any issues with the merge.  Thanks to
    everyone who helped me get this in!
    
    
    On Tue, Jun 17, 2014 at 2:48 PM, Reynold Xin <no...@github.com>
    wrote:
    
    > There was a conflict that I had to merge manually. Take a look at master
    > to make sure everything is ok. I did compile and ran a couple things.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/369#issuecomment-46348842>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42391069
  
    Would appreciate getting this into 1.0 but it can wait until later as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42904709
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43574435
  
    Success!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42391059
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42905647
  
    Fixed up with an unfortunate force push.  I typically use merges for these sorts of things but forgot how bad they appear in GitHub.  Will remember to use rebase for GitHub projects in the future.
    
    Any comments on the Java/Python bridges?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-46255197
  
    Ping I think this is ready for merging.  There's been some discussion about how to pass the implicit parameters to the methods (in two phases of discussion!) but I think we're in a good place with these right now and can expect to keep this API going forward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42891085
  
    This looks good. This would be a useful API to add to Spark to facilitate sorting. Do you mind looking into Java and Python API for this as well? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40031354
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40031345
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-39971934
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12554718
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -541,6 +543,46 @@ class RDDSuite extends FunSuite with SharedSparkContext {
         }
       }
     
    +  test("sortByKey") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val col1 = Array("4|60|C", "5|50|A", "6|40|B")
    +    val col2 = Array("6|40|B", "5|50|A", "4|60|C")
    +    val col3 = Array("5|50|A", "6|40|B", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0)).collect === col1)
    +    assert(data.sortBy(_.split("\\|")(1)).collect === col2)
    +    assert(data.sortBy(_.split("\\|")(2)).collect === col3)
    +  }
    +
    +  test("sortByKey ascending parameter") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val asc = Array("4|60|C", "5|50|A", "6|40|B")
    +    val desc = Array("6|40|B", "5|50|A", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0), true).collect === asc)
    +    assert(data.sortBy(_.split("\\|")(0), false).collect === desc)
    +  }
    +
    +  // issues with serialization of Ordering in the test
    +  ignore("sortByKey with explicit ordering") {
    +    val data = sc.parallelize(Seq("Bob|Smith|50", "Jane|Smith|40", "Thomas|Williams|30", "Karen|Williams|60"))
    --- End diff --
    
    a few of the lines in this function exceed 100 chars


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40010754
  
    Through this pull request, I just realized how inefficient the sorting code
    was (one that takes Ordered). I'm glad you are using Ordering instead. We
    should change the sortByKey implementation ...
    
    For the test code, do you mind removing the extra spaces you used to align
    =s?
    
    And can we take an implicit Ordering like the way Scala lib does? This way,
    the user can also change the ordering.
    
    
    On Wed, Apr 9, 2014 at 7:40 AM, Andrew Ash <no...@github.com> wrote:
    
    > This never got merged from the apache/incubator-spark repo (which is now
    > deleted) but there had been several rounds of code review on this PR there.
    >
    > I think this is ready for merging.
    > ------------------------------
    > You can merge this Pull Request by running
    >
    >   git pull https://github.com/ash211/spark sortby
    >
    > Or view, comment on, or merge it at:
    >
    >   https://github.com/apache/spark/pull/369
    > Commit Summary
    >
    >    - Add .sortBy(f) method on RDD
    >    - Merge remote-tracking branch 'origin/master' into sortby
    >    - Support ascending and numPartitions params in sortBy()
    >    - Correct silly typo
    >    - Actually use ascending and numPartitions parameters
    >
    > File Changes
    >
    >    - *M* core/src/main/scala/org/apache/spark/rdd/RDD.scala<https://github.com/apache/spark/pull/369/files#diff-0>(11)
    >    - *M* core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala<https://github.com/apache/spark/pull/369/files#diff-1>(12)
    >
    > Patch Links:
    >
    >    - https://github.com/apache/spark/pull/369.patch
    >    - https://github.com/apache/spark/pull/369.diff
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/369>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43572007
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15087/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621279
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala ---
    @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
         rdd.setName(name)
         this
       }
    +
    +  /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    +    import scala.collection.JavaConverters._
    +    def fn = (x: T) => f.call(x)
    +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    --- End diff --
    
    Ordering.natural() returns an Ordering object, which implements the
    Comparator interface.
    
    http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/Ordering.html#natural()
    
    
    On Tue, May 13, 2014 at 10:16 PM, Reynold Xin <no...@github.com>wrote:
    
    > In core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala:
    >
    > > @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
    > >      rdd.setName(name)
    > >      this
    > >    }
    > > +
    > > +  /**
    > > +   * Return this RDD sorted by the given key function.
    > > +   */
    > > +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    > > +    import scala.collection.JavaConverters._
    > > +    def fn = (x: T) => f.call(x)
    > > +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    >
    > How does this actually work? Shouldn't natural() return a Comparator?
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/369/files#r12621237>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42905778
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13418102
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    Look at the discussion at the beginning of this PR, but the reason for this is related to the fact that the user may want to actually specify the Ordering explicitly, so we didn't want the sugar. Also you can't include both a context bound for ClassTag and an implicit argument list, I believe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40034670
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40031655
  
    It's likely you'll have to specify the classtag as an implicit parameter of its own, by appending the following argument: `(implicit ctag: ClassTag[K])`, and remove the context bound in the type signature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13422474
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    @markhamstra I checked RDD.scala and found the style you talked about. But one thing I don't understand is:`def countByValue()(implicit ord: Ordering[T] = null)`. I can't find the use of implicit `Ordering[T]` anywhere inside the method, and I compiled spark successfully with this implicit Ordering[T] removed. Did I miss something here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621764
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -541,14 +543,64 @@ class RDDSuite extends FunSuite with SharedSparkContext {
         }
       }
     
    +  test("sortByKey") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val col1 = Array("4|60|C", "5|50|A", "6|40|B")
    +    val col2 = Array("6|40|B", "5|50|A", "4|60|C")
    +    val col3 = Array("5|50|A", "6|40|B", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0)).collect() === col1)
    +    assert(data.sortBy(_.split("\\|")(1)).collect() === col2)
    +    assert(data.sortBy(_.split("\\|")(2)).collect() === col3)
    +  }
    +
    +  test("sortByKey ascending parameter") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val asc = Array("4|60|C", "5|50|A", "6|40|B")
    +    val desc = Array("6|40|B", "5|50|A", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0), true).collect() === asc)
    +    assert(data.sortBy(_.split("\\|")(0), false).collect() === desc)
    +  }
    +
    +  // issues with serialization of Ordering in the test
    +  ignore("sortByKey with explicit ordering") {
    +    val data = sc.parallelize(Seq("Bob|Smith|50",
    +                                  "Jane|Smith|40",
    +                                  "Thomas|Williams|30",
    +                                  "Karen|Williams|60"))
    +
    +    val ageOrdered = Array("Thomas|Williams|30",
    +                           "Jane|Smith|40",
    +                           "Bob|Smith|50",
    +                           "Karen|Williams|60")
    +
    +    // last name, then first name
    +    val nameOrdered = Array("Bob|Smith|50",
    +                            "Jane|Smith|40",
    +                            "Karen|Williams|60",
    +                            "Thomas|Williams|30")
    +
    +    def parse(s: String): Person = {
    +      val split = s.split("\\|")
    +      Person(split(0), split(1), split(2).toInt)
    +    }
    +
    +    import scala.reflect.classTag
    +    assert(data.sortBy(parse, false, 2)(AgeOrdering, classTag[Person]) === ageOrdered)
    --- End diff --
    
    ascending should be true, not false


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40266965
  
    Moved the Ordering outside of the test class, but it still brings in a SparkConf from somewhere that fails to serialize.  Sorry, I'm not much of a Scala expert so will need some handholding here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42904649
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621830
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -541,14 +543,64 @@ class RDDSuite extends FunSuite with SharedSparkContext {
         }
       }
     
    +  test("sortByKey") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val col1 = Array("4|60|C", "5|50|A", "6|40|B")
    +    val col2 = Array("6|40|B", "5|50|A", "4|60|C")
    +    val col3 = Array("5|50|A", "6|40|B", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0)).collect() === col1)
    +    assert(data.sortBy(_.split("\\|")(1)).collect() === col2)
    +    assert(data.sortBy(_.split("\\|")(2)).collect() === col3)
    +  }
    +
    +  test("sortByKey ascending parameter") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val asc = Array("4|60|C", "5|50|A", "6|40|B")
    +    val desc = Array("6|40|B", "5|50|A", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0), true).collect() === asc)
    +    assert(data.sortBy(_.split("\\|")(0), false).collect() === desc)
    +  }
    +
    +  // issues with serialization of Ordering in the test
    +  ignore("sortByKey with explicit ordering") {
    +    val data = sc.parallelize(Seq("Bob|Smith|50",
    +                                  "Jane|Smith|40",
    +                                  "Thomas|Williams|30",
    +                                  "Karen|Williams|60"))
    +
    +    val ageOrdered = Array("Thomas|Williams|30",
    +                           "Jane|Smith|40",
    +                           "Bob|Smith|50",
    +                           "Karen|Williams|60")
    +
    +    // last name, then first name
    +    val nameOrdered = Array("Bob|Smith|50",
    +                            "Jane|Smith|40",
    +                            "Karen|Williams|60",
    +                            "Thomas|Williams|30")
    +
    +    def parse(s: String): Person = {
    --- End diff --
    
    I'll take a pass through and see what I come up with.  Thanks!
    
    
    On Tue, May 13, 2014 at 10:52 PM, Reynold Xin <no...@github.com>wrote:
    
    > In core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:
    >
    > > +                                  "Jane|Smith|40",
    > > +                                  "Thomas|Williams|30",
    > > +                                  "Karen|Williams|60"))
    > > +
    > > +    val ageOrdered = Array("Thomas|Williams|30",
    > > +                           "Jane|Smith|40",
    > > +                           "Bob|Smith|50",
    > > +                           "Karen|Williams|60")
    > > +
    > > +    // last name, then first name
    > > +    val nameOrdered = Array("Bob|Smith|50",
    > > +                            "Jane|Smith|40",
    > > +                            "Karen|Williams|60",
    > > +                            "Thomas|Williams|30")
    > > +
    > > +    def parse(s: String): Person = {
    >
    > if you change this to a closure, then it will pass without the
    > serialization problem, e.g.
    >
    > val parse = (s: String) => {
    >
    >   val split = s.split("\\|")
    >
    >   Person(split(0), split(1), split(2).toInt)}
    >
    > —
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/369/files#r12621775>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13387198
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    That substitution certainly could be made since it is functionally equivalent.  Whether it should be made is mostly a question of style.  Since `ord` and `ctag` are not used explicitly here, there is no real use to enumerating them as implicit parameters; but the rest of RDD.scala doesn't use the context bound sugar, so there is some value in consistently maintaining that style. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-45053945
  
    @rxin want to give this a final sign off?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-44148572
  
    Any objections to merging this `.sortBy()` method?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43049575
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40036254
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13973/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40267016
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40267020
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43052152
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14969/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42906147
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14920/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-39976773
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13950/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-46343296
  
    I will test this today.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42906146
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42906069
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14919/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621500
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala ---
    @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
         rdd.setName(name)
         this
       }
    +
    +  /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    +    import scala.collection.JavaConverters._
    +    def fn = (x: T) => f.call(x)
    +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    --- End diff --
    
    I will submit a pr to fix these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42391055
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42391095
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14758/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-46345169
  
    This looks good to me. I will merge it.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43049565
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40267041
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14067/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40034991
  
    You'll have to define your Ordering outside of the test class, otherwise it will try to bring in the test class's implicit "RDDSuite.this" into the closure. You can define it inside an RDDSuite companion object, that probably preserves the desired visibility semantics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by markhamstra <gi...@git.apache.org>.

Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13423508
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    https://issues.apache.org/jira/browse/SPARK-1540
    https://github.com/apache/spark/commit/640f9a0efefd42cff86aecd4878a3a57f5ae85fa
    
    
    On Wed, Jun 4, 2014 at 8:14 PM, Wenchen Fan <no...@github.com>
    wrote:
    
    > In core/src/main/scala/org/apache/spark/rdd/RDD.scala:
    >
    > > @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
    > >    def ++(other: RDD[T]): RDD[T] = this.union(other)
    > >
    > >    /**
    > > +   * Return this RDD sorted by the given key function.
    > > +   */
    > > +  def sortBy[K](
    > > +      f: (T) ⇒ K,
    > > +      ascending: Boolean = true,
    > > +      numPartitions: Int = this.partitions.size)
    > > +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    >
    > @markhamstra <https://github.com/markhamstra> I checked RDD.scala and
    > found the style you talked about. But one thing I don't understand is:def
    > countByValue()(implicit ord: Ordering[T] = null). I can't find the use of
    > implicit Ordering[T] anywhere inside the method, and I compiled spark
    > successfully with this implicit Ordering[T] removed. Did I miss something
    > here?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/369/files#r13422474>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42873882
  
    Ping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42905496
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621237
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala ---
    @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
         rdd.setName(name)
         this
       }
    +
    +  /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    +    import scala.collection.JavaConverters._
    +    def fn = (x: T) => f.call(x)
    +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    --- End diff --
    
    How does this actually work? Shouldn't natural() return a Comparator?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42905502
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by aarondav <gi...@git.apache.org>.

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40033173
  
    Ah, alas, looks like the context bound just desugars to the exact same thing here, except we've just reordered the implicit arguments. We might be able to get around this by clever usage of implicit conversions, but it almost certainly wouldn't be worth it. At least having the whole parameter list explicitly listed out makes it more obvious that you can specify the Ordering, even if we can't remove the ugliness of also specifying a ClassTag too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40034678
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12554704
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -541,6 +543,46 @@ class RDDSuite extends FunSuite with SharedSparkContext {
         }
       }
     
    +  test("sortByKey") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val col1 = Array("4|60|C", "5|50|A", "6|40|B")
    +    val col2 = Array("6|40|B", "5|50|A", "4|60|C")
    +    val col3 = Array("5|50|A", "6|40|B", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0)).collect === col1)
    --- End diff --
    
    mind adding parenthesis to all the collect's? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621775
  
    --- Diff: core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala ---
    @@ -541,14 +543,64 @@ class RDDSuite extends FunSuite with SharedSparkContext {
         }
       }
     
    +  test("sortByKey") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val col1 = Array("4|60|C", "5|50|A", "6|40|B")
    +    val col2 = Array("6|40|B", "5|50|A", "4|60|C")
    +    val col3 = Array("5|50|A", "6|40|B", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0)).collect() === col1)
    +    assert(data.sortBy(_.split("\\|")(1)).collect() === col2)
    +    assert(data.sortBy(_.split("\\|")(2)).collect() === col3)
    +  }
    +
    +  test("sortByKey ascending parameter") {
    +    val data = sc.parallelize(Seq("5|50|A","4|60|C", "6|40|B"))
    +
    +    val asc = Array("4|60|C", "5|50|A", "6|40|B")
    +    val desc = Array("6|40|B", "5|50|A", "4|60|C")
    +
    +    assert(data.sortBy(_.split("\\|")(0), true).collect() === asc)
    +    assert(data.sortBy(_.split("\\|")(0), false).collect() === desc)
    +  }
    +
    +  // issues with serialization of Ordering in the test
    +  ignore("sortByKey with explicit ordering") {
    +    val data = sc.parallelize(Seq("Bob|Smith|50",
    +                                  "Jane|Smith|40",
    +                                  "Thomas|Williams|30",
    +                                  "Karen|Williams|60"))
    +
    +    val ageOrdered = Array("Thomas|Williams|30",
    +                           "Jane|Smith|40",
    +                           "Bob|Smith|50",
    +                           "Karen|Williams|60")
    +
    +    // last name, then first name
    +    val nameOrdered = Array("Bob|Smith|50",
    +                            "Jane|Smith|40",
    +                            "Karen|Williams|60",
    +                            "Thomas|Williams|30")
    +
    +    def parse(s: String): Person = {
    --- End diff --
    
    if you change this to a closure, then it will pass without the serialization problem, e.g.
    ```scala
    val parse = (s: String) => {
      val split = s.split("\\|")
      Person(split(0), split(1), split(2).toInt)
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/369


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r12621760
  
    --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaRDD.scala ---
    @@ -150,6 +153,18 @@ class JavaRDD[T](val rdd: RDD[T])(implicit val classTag: ClassTag[T])
         rdd.setName(name)
         this
       }
    +
    +  /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[S](f: JFunction[T, S], ascending: Boolean, numPartitions: Int): JavaRDD[T] = {
    +    import scala.collection.JavaConverters._
    +    def fn = (x: T) => f.call(x)
    +    implicit val ordering = com.google.common.collect.Ordering.natural().asInstanceOf[Ordering[S]]
    --- End diff --
    
    Actually I gotta focus on something else and considering this is might be too last minute for 1.0 given it introduces new APIs, we have more time to push it into 1.1.  Do you mind fixing this yourself? The problem is you should've cast the type to Comparator[S] instead of Ordering[S] (which is Scala's ordering in this scope).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40033139
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13970/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40033138
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42904665
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40031343
  
    How would you write the signature with the implicit Ordering parameter as well?
    
    I tried with the below definition and below test
    
    ```
      def sortBy[K: ClassTag](
          f: (T) ⇒ K,
          ascending: Boolean = true,
          numPartitions: Int = this.partitions.size)
          (implicit ord: Ordering[K]): RDD[T] =
        this.keyBy[K](f)
            .sortByKey(ascending, numPartitions)
            .values
    ```
    
    ```
      test("sortByKey with explicit ordering") {
        val data = sc.parallelize(Seq("Bob|Smith|50", "Jane|Smith|40", "Thomas|Williams|30", "Karen|Williams|60"))
    
        val ageOrdered = Array("Thomas|Williams|30", "Jane|Smith|40", "Bob|Smith|50", "Karen|Williams|60")
        // last name, then first name
        val nameOrdered = Array("Bob|Smith|50", "Jane|Smith|40", "Karen|Williams|60", "Thomas|Williams|30")
    
        case class Person(first: String, last: String, age: Int)
    
        def parse(s: String): Person = {
          val split = s.split("\\|")
          Person(split(0), split(1), split(2).toInt)
        }
    
        object AgeOrdering extends Ordering[Person] {
          def compare(a:Person, b:Person) = a.age compare b.age
        }
    
        object NameOrdering extends Ordering[Person] {
          def compare(a:Person, b:Person) =
            implicitly[Ordering[Tuple2[String,String]]].compare((a.last, a.first), (b.last, b.first))
        }
    
        assert(data.sortBy(parse, false, 2)(AgeOrdering) === ageOrdered)
        assert(data.sortBy(parse, false, 2)(NameOrdering) === nameOrdered)
      }
    ```
    
    But got an error that I'd have to pass in the ClassTag implicitly as well:
    
    ```
    [error] /Users/aash/git/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:555: not enough arguments for method sortBy: (implicit evidence$5: scala.reflect.ClassTag[Person], implicit ord: Ordering[Person])org.apache.spark.rdd.RDD[String].
    [error] Unspecified value parameter ord.
    [error]     assert(data.sortBy(parse, false, 2)(AgeOrdering) === ageOrdered)
    [error]                                        ^
    [error] /Users/aash/git/spark/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala:556: not enough arguments for method sortBy: (implicit evidence$5: scala.reflect.ClassTag[Person], implicit ord: Ordering[Person])org.apache.spark.rdd.RDD[String].
    [error] Unspecified value parameter ord.
    [error]     assert(data.sortBy(parse, false, 2)(NameOrdering) === nameOrdered)
    [error]                                        ^
    [error] two errors found
    [error] (core/test:compile) Compilation failed
    [error] Total time: 9 s, completed Apr 10, 2014 1:13:27 AM
    aash@aash-mbp ~/git/spark$
    ```
    
    Is there a way to set the ordering of the implicit parameters so that the Ordering goes before the ClassTag ?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-40034695
  
    I tried writing a test for the Ordering parameter but am getting issues with SparkConf not being serializable.  I suspect that the surrounding context of the Ordering parameter is causing issues.  When I comment out the last two lines that use the Ordering, the class compiles and works (though obviously doesn't actually test anything).
    
    ```
    [info] - sortByKey with explicit ordering *** FAILED *** (11 milliseconds)
    [info]   org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkConf
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1017)
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1015)
    [info]   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    [info]   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1015)
    [info]   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:778)
    [info]   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:721)
    [info]   at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:551)
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
    [info]   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    [info]   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    [info]   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    [info]   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    [info]   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    [info]   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    [info]   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    [info]   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    [info]   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    ```
    
    Any ideas on how to properly call it with this extra implicit parameter?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-42904551
  
    That merge makes the GitHub diffs really unclean, one sec and I'll make it a nicer clean merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/369#issuecomment-43569426
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD

Posted by ash211 <gi...@git.apache.org>.

Github user ash211 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/369#discussion_r13530130
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
    @@ -423,6 +423,18 @@ abstract class RDD[T: ClassTag](
       def ++(other: RDD[T]): RDD[T] = this.union(other)
     
       /**
    +   * Return this RDD sorted by the given key function.
    +   */
    +  def sortBy[K](
    +      f: (T) ⇒ K,
    +      ascending: Boolean = true,
    +      numPartitions: Int = this.partitions.size)
    +      (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] =
    --- End diff --
    
    Do we need to further discuss the way the parameter is passed or is the current method sufficient?  I'm not sure the implications for binary compatibility in the future if we switch between the two (context bound vs implicits).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---