You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jay Hutfles <ja...@gmail.com> on 2014/12/19 20:21:13 UTC

spark-shell bug with RDDs and case classes?

Found a problem in the spark-shell, but can't confirm that it's related to
open issues on Spark's JIRA page.  I was wondering if anyone could help
identify if this is an issue or if it's already being addressed.

Test:  (in spark-shell)
case class Person(name: String, age: Int)
val peopleList = List(Person("Alice", 35), Person("Bob", 47),
Person("Alice", 35), Person("Bob", 15))
val peopleRDD = sc.parallelize(peopleList)
assert(peopleList.distinct.size == peopleRDD.distinct.count) 


At first I thought it was related to issue SPARK-2620
(https://issues.apache.org/jira/browse/SPARK-2620), which says case classes
can't be used as keys in spark-shell due to how case classes are compiled by
the REPL.  It lists .reduceByKey, .groupByKey and .distinct as being
affected.  But the associated pull request for adding tests to cover this
(https://github.com/apache/spark/pull/1588) was closed.  

Is this something I just have to live with when using the REPL?  Or is this
covered by something bigger that's being addressed?

Thanks in advance



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-bug-with-RDDs-and-case-classes-tp20789.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark-shell bug with RDDs and case classes?

Posted by Sean Owen <so...@cloudera.com>.
AFAIK it's a known issue of some sort in the Scala REPL, which is what
the Spark REPL is. The PR that was closed was just adding tests to
show it's a bug. I don't know if there is any workaround now.

On Fri, Dec 19, 2014 at 7:21 PM, Jay Hutfles <ja...@gmail.com> wrote:
> Found a problem in the spark-shell, but can't confirm that it's related to
> open issues on Spark's JIRA page.  I was wondering if anyone could help
> identify if this is an issue or if it's already being addressed.
>
> Test:  (in spark-shell)
> case class Person(name: String, age: Int)
> val peopleList = List(Person("Alice", 35), Person("Bob", 47),
> Person("Alice", 35), Person("Bob", 15))
> val peopleRDD = sc.parallelize(peopleList)
> assert(peopleList.distinct.size == peopleRDD.distinct.count)
>
>
> At first I thought it was related to issue SPARK-2620
> (https://issues.apache.org/jira/browse/SPARK-2620), which says case classes
> can't be used as keys in spark-shell due to how case classes are compiled by
> the REPL.  It lists .reduceByKey, .groupByKey and .distinct as being
> affected.  But the associated pull request for adding tests to cover this
> (https://github.com/apache/spark/pull/1588) was closed.
>
> Is this something I just have to live with when using the REPL?  Or is this
> covered by something bigger that's being addressed?
>
> Thanks in advance
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-bug-with-RDDs-and-case-classes-tp20789.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org