You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rahul Bindlish <ra...@nectechnologies.in> on 2014/12/04 08:54:15 UTC

serialization issue in case of case class is more than 1

Hi,

I am newbie in Spark and performed following steps during POC execution:

1. Map csv file to object-file after some transformations once.
2. Serialize object-file to RDD for operation, as per need.

In case of 2 csv/object-files, first object-file is serialized to RDD
successfully but during serialization of second object-file error appears.
This error occurs only when spark-shell is restarted between step-1 and
step-2.

Please suggest how to serialize 2 object-files.

Also find below executed code on spark-shell
*******************************************
//#1//Start spark-shell and csv to object-file creation
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

case class person(id: Int, name: String, fathername: String, officeid: Int)
val baseperson = sc.textFile("person_csv").flatMap(line =>
line.split("\n")).map(_.split(","))
baseperson.map(p => person(p(0).trim.toInt, p(1), p(2),
p(3).trim.toInt)).saveAsObjectFile("person_obj")

case class office(id: Int, name: String, landmark: String, areacode: String)
val baseoffice = sc.textFile("office_csv").flatMap(line =>
line.split("\n")).map(_.split(","))
baseoffice.map(p => office(p(0).trim.toInt, p(1), p(2),
p(3))).saveAsObjectFile("office_obj")

//#2//Stop spark-shell
//#3//Start spark-shell and map object-file
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
case class person(id: Int, name: String, fathername: String, officeid: Int)
case class office(id: Int, name: String, landmark: String, areacode: String)

sc.objectFile[person]("person_obj").count [OK]
sc.objectFile[office]("office_obj").count *[FAILS]*
*******************************************
stack trace is attached
stacktrace.txt
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n20334/stacktrace.txt> 
rahul@...
*******************************************

Regards,
Rahul		







--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Rahul Bindlish <ra...@nectechnologies.in>.

Tobias,

Understand and thanks for quick resolution of problem.

Thanks
~Rahul



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20446.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Rahul,

On Fri, Dec 5, 2014 at 3:51 PM, Rahul Bindlish <
rahul.bindlish@nectechnologies.in> wrote:
>
> 1. Copy csv files in current directory.
> 2. Open spark-shell from this directory.
> 3. Run "one_scala" file which will create object-files from csv-files in
> current directory.
> 4. Restart spark-shell
> 5. a. Run "two_scala" file, while running it is giving error during loading
> of office_csv
>     b. If we edit "two_scala" file by below contents
>
> -----------------------------------------------------------------------------------
> case class person(id: Int, name: String, fathername: String, officeid: Int)
> case class office(id: Int, name: String, landmark: String, areacode:
> String)
> sc.objectFile[office]("office_obj").count
> sc.objectFile[person]("person_obj").count
>
> --------------------------------------------------------------------------------
> while running it is giving error during loading of person_csv
>

One good news is: I can reproduce the error you see.

Another good news is: I can tell you how to fix this. In your one.scala
file, define all case classes *before* you use saveAsObjectFile() for the
first time. With
  case class person(id: Int, name: String, fathername: String, officeid:
Int)
  case class office(id: Int, name: String, landmark: String, areacode:
String)
  val baseperson =
sc.textFile("person_csv")....saveAsObjectFile("person_obj")
  val baseoffice =
sc.textFile("office_csv")....saveAsObjectFile("office_obj")
I can deserialize the obj files (in any order).

The bad news is: I have no idea about the reason for this. I blame it on
the REPL/shell and assume it would not happen for a compiled application.

Tobias

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Rahul Bindlish <ra...@nectechnologies.in>.

Tobias,

Find csv and scala files and below are steps:

1. Copy csv files in current directory.
2. Open spark-shell from this directory.
3. Run "one_scala" file which will create object-files from csv-files in
current directory.
4. Restart spark-shell
5. a. Run "two_scala" file, while running it is giving error during loading
of office_csv
    b. If we edit "two_scala" file by below contents 

-----------------------------------------------------------------------------------
case class person(id: Int, name: String, fathername: String, officeid: Int) 
case class office(id: Int, name: String, landmark: String, areacode: String) 
sc.objectFile[office]("office_obj").count
sc.objectFile[person]("person_obj").count 
--------------------------------------------------------------------------------
while running it is giving error during loading of person_csv

Regards,
Rahul

sample.gz
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n20435/sample.gz>  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Imran Rashid <im...@therashids.com>.

> It's an easy mistake to make... I wonder if an assertion could be
implemented that makes sure the type parameter is present.

We could use the "NotNothing" pattern

http://blog.evilmonkeylabs.com/2012/05/31/Forcing_Compiler_Nothing_checks/

but I wonder if it would just make the method signature very confusing for
the avg user ...

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Daniel Darabos <da...@lynxanalytics.com>.

On Fri, Dec 5, 2014 at 7:12 AM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Rahul,
>
> On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish <
> rahul.bindlish@nectechnologies.in> wrote:
>>
>> I have done so thats why spark is able to load objectfile [e.g.
>> person_obj]
>> and spark has maintained serialVersionUID [person_obj].
>>
>> Next time when I am trying to load another objectfile [e.g. office_obj]
>> and
>> I think spark is matching serialVersionUID [person_obj] with previous
>> serialVersionUID [person_obj] and giving mismatch error.
>>
>> In my first post, I have give statements which can be executed easily to
>> replicate this issue.
>>
>
> Can you post the Scala source for your case classes? I have tried the
> following in spark-shell:
>
> case class Dog(name: String)
> case class Cat(age: Int)
> val dogs = sc.parallelize(Dog("foo") :: Dog("bar") :: Nil)
> val cats = sc.parallelize(Cat(1) :: Cat(2) :: Nil)
> dogs.saveAsObjectFile("test_dogs")
> cats.saveAsObjectFile("test_cats")
>
> This gives two directories "test_dogs/" and "test_cats/". Then I restarted
> spark-shell and entered:
>
> case class Dog(name: String)
> case class Cat(age: Int)
> val dogs = sc.objectFile("test_dogs")
> val cats = sc.objectFile("test_cats")
>
> I don't get an exception, but:
>
> dogs: org.apache.spark.rdd.RDD[Nothing] = FlatMappedRDD[1] at objectFile
> at <console>:12
>

You need to specify the type of the RDD. The compiler does not know what is
in "test_dogs".

val dogs = sc.objectFile[Dog]("test_dogs")
val cats = sc.objectFile[Cat]("test_cats")

It's an easy mistake to make... I wonder if an assertion could be
implemented that makes sure the type parameter is present.

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Rahul,

On Fri, Dec 5, 2014 at 2:50 PM, Rahul Bindlish <
rahul.bindlish@nectechnologies.in> wrote:
>
> I have done so thats why spark is able to load objectfile [e.g. person_obj]
> and spark has maintained serialVersionUID [person_obj].
>
> Next time when I am trying to load another objectfile [e.g. office_obj] and
> I think spark is matching serialVersionUID [person_obj] with previous
> serialVersionUID [person_obj] and giving mismatch error.
>
> In my first post, I have give statements which can be executed easily to
> replicate this issue.
>

Can you post the Scala source for your case classes? I have tried the
following in spark-shell:

case class Dog(name: String)
case class Cat(age: Int)
val dogs = sc.parallelize(Dog("foo") :: Dog("bar") :: Nil)
val cats = sc.parallelize(Cat(1) :: Cat(2) :: Nil)
dogs.saveAsObjectFile("test_dogs")
cats.saveAsObjectFile("test_cats")

This gives two directories "test_dogs/" and "test_cats/". Then I restarted
spark-shell and entered:

case class Dog(name: String)
case class Cat(age: Int)
val dogs = sc.objectFile("test_dogs")
val cats = sc.objectFile("test_cats")

I don't get an exception, but:

dogs: org.apache.spark.rdd.RDD[Nothing] = FlatMappedRDD[1] at objectFile at
<console>:12

Trying to access the elements of the RDD gave:

scala> dogs.collect()
14/12/05 15:08:58 INFO FileInputFormat: Total input paths to process : 8
...
org.apache.spark.SparkDriverExecutionException: Execution error
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:980)
...
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;
at scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:88)
at
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1129)
...
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:976)
... 10 more

So even in the simplest of cases, this doesn't work for me in the
spark-shell, but with a different error. I guess we need to see more of
your code to help.

Tobias

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Rahul Bindlish <ra...@nectechnologies.in>.

Tobias,

Thanks for quick reply.

Definitely, after restart case classes need to be defined again.

I have done so thats why spark is able to load objectfile [e.g. person_obj]
and spark has maintained serialVersionUID [person_obj].

Next time when I am trying to load another objectfile [e.g. office_obj] and
I think spark is matching serialVersionUID [person_obj] with previous
serialVersionUID [person_obj] and giving mismatch error.

In my first post, I have give statements which can be executed easily to
replicate this issue.

Thanks
~Rahul








--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Rahul,

On Fri, Dec 5, 2014 at 1:29 PM, Rahul Bindlish <
rahul.bindlish@nectechnologies.in> wrote:
>
> I have created  objectfiles [person_obj,office_obj] from
> csv[person_csv,office_csv] files using case classes[person,office] with API
> (saveAsObjectFile)
>
> Now I restarted spark-shell and load objectfiles using API(objectFile).
>
> *Once any of one object-class is loaded successfully, rest of object-class
> gives serialization error.*
>

I have not used saveAsObjectFile, but I think that if you define your case
classes in the spark-shell and serialized the objects, and then you restart
the spark-shell, the *classes* (structure, names etc.) will not be known to
the JVM any more. So if you try to restore the *objects* from a file, the
JVM may fail in restoring them, because there is no class it could create
objects of. Just a guess. Try to write a Scala program, compile it and see
if it still fails when executed.

Tobias

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Rahul Bindlish <ra...@nectechnologies.in>.

Hi Tobias,

Thanks Tobias for your response.

I have created  objectfiles [person_obj,office_obj] from
csv[person_csv,office_csv] files using case classes[person,office] with API
(saveAsObjectFile)

Now I restarted spark-shell and load objectfiles using API(objectFile).

*Once any of one object-class is loaded successfully, rest of object-class
gives serialization error.*

So my understanding is that more than one case class is not allowed.

Hope, I am able to clarify myself.

Regards,
Rahul





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SPARK LIMITATION - more than one case class is not allowed !!

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

On Fri, Dec 5, 2014 at 12:53 PM, Rahul Bindlish <
rahul.bindlish@nectechnologies.in> wrote:

> Is it a limitation that spark does not support more than one case class at
> a
> time.
>

What do you mean? I do not have the slightest idea what you *could*
possibly mean by "to support a case class".

Tobias

SPARK LIMITATION - more than one case class is not allowed !!

Posted by Rahul Bindlish <ra...@nectechnologies.in>.

Is it a limitation that spark does not support more than one case class at a
time.

Regards,
Rahul



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-issue-in-case-of-case-class-is-more-than-1-tp20334p20415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org