You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jay Hutfles <ja...@gmail.com> on 2014/12/19 19:55:50 UTC

spark-shell bug with RDD distinct?

Found a problem in the spark-shell, but can't confirm that it's related to
open issues on Spark's JIRA page.  I was wondering if anyone could help
identify if this is an issue or if it's already being addressed.

Test:  (in spark-shell)
case class Person(name: String, age: Int)
val peopleList = List(Person("Alice", 35), Person("Bob", 47),
Person("Alice", 35), Person("Bob", 15))
val peopleRDD = sc.parallelize(peopleList)
assert(peopleList.distinct.size == peopleRDD.distinct.count)


At first I thought it was related to issue SPARK-2620 (
https://issues.apache.org/jira/browse/SPARK-2620), which says case classes
can't be used as keys in spark-shell due to how case classes are compiled
by the REPL.  It lists .reduceByKey, .groupByKey and .distinct as being
affected.  But the associated pull request for adding tests to cover this (
https://github.com/apache/spark/pull/1588) was closed.

Is this something I just have to live with when using the REPL?  Or is this
covered by something bigger that's being addressed?

Thanks in advance
   -Jay