You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Michael Malak <mi...@yahoo.com> on 2014/05/14 00:30:45 UTC

Class-based key in groupByKey?

Is it permissible to use a custom class (as opposed to e.g. the built-in String or Int) for the key in groupByKey? It doesn't seem to be working for me on Spark 0.9.0/Scala 2.10.3:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

class C(val s:String) extends Serializable {
  override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false
  override def toString = s
}

object SimpleApp {
  def main(args: Array[String]) {
    val sc = new SparkContext("local", "Simple App", null, null)
    val r1 = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
    println("r1=" + r1.groupByKey.collect.mkString(";"))
    val r2 = sc.parallelize(Array(("a",11),("a",12)))
    println("r2=" + r2.groupByKey.collect.mkString(";"))
  }
}


Output
======
r1=(a,ArrayBuffer(11));(a,ArrayBuffer(12))
r2=(a,ArrayBuffer(11, 12))

Re: Class-based key in groupByKey?

Posted by Matei Zaharia <ma...@gmail.com>.

Your key needs to implement hashCode in addition to equals.

Matei

On May 13, 2014, at 3:30 PM, Michael Malak <mi...@yahoo.com> wrote:

> Is it permissible to use a custom class (as opposed to e.g. the built-in String or Int) for the key in groupByKey? It doesn't seem to be working for me on Spark 0.9.0/Scala 2.10.3:
> 
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> 
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s == s else false
>   override def toString = s
> }
> 
> object SimpleApp {
>   def main(args: Array[String]) {
>     val sc = new SparkContext("local", "Simple App", null, null)
>     val r1 = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
>     println("r1=" + r1.groupByKey.collect.mkString(";"))
>     val r2 = sc.parallelize(Array(("a",11),("a",12)))
>     println("r2=" + r2.groupByKey.collect.mkString(";"))
>   }
> }
> 
> 
> Output
> ======
> r1=(a,ArrayBuffer(11));(a,ArrayBuffer(12))
> r2=(a,ArrayBuffer(11, 12))

Re: Class-based key in groupByKey?

Posted by Andrew Ash <an...@andrewash.com>.

In Scala, if you override .equals() you also need to override .hashCode(),
just like in Java:

http://www.scala-lang.org/api/2.10.3/index.html#scala.AnyRef

I suspect if your .hashCode() delegates to just the hashcode of s then
you'd be good.


On Tue, May 13, 2014 at 3:30 PM, Michael Malak <mi...@yahoo.com>wrote:

> Is it permissible to use a custom class (as opposed to e.g. the built-in
> String or Int) for the key in groupByKey? It doesn't seem to be working for
> me on Spark 0.9.0/Scala 2.10.3:
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
>
> class C(val s:String) extends Serializable {
>   override def equals(o: Any) = if (o.isInstanceOf[C]) o.asInstanceOf[C].s
> == s else false
>   override def toString = s
> }
>
> object SimpleApp {
>   def main(args: Array[String]) {
>     val sc = new SparkContext("local", "Simple App", null, null)
>     val r1 = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
>     println("r1=" + r1.groupByKey.collect.mkString(";"))
>     val r2 = sc.parallelize(Array(("a",11),("a",12)))
>     println("r2=" + r2.groupByKey.collect.mkString(";"))
>   }
> }
>
>
> Output
> ======
> r1=(a,ArrayBuffer(11));(a,ArrayBuffer(12))
> r2=(a,ArrayBuffer(11, 12))
>