You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Madhu <ma...@madhu.com> on 2014/05/15 03:02:02 UTC

Re: Hadoop Writable and Spark serialization

I have done this kind of thing successfully using Hadoop serialization, e.g.
SessionContainer extends Writable and override write/readFields. I didn't
try Kyro.

It's fairly straightforward, I'll see if I can dig up the code if you really
need it.
I remember that I had to add a map transformation or something to that
effect since Hadoop sometimes gives you a mutated reference to a previous
object rather than a new one :-(

Also, I don't think you need to parallelize sampledSessions in your code
snippet.
I think this will work:

   val sampledSessions = sc.sequenceFile[Text,
SessionContainer](inputPath).takeSample(false, 1000, 0)
   sampledSessions.saveAsSequenceFile("sampledSessions")

How many small files are you getting?
I tend to think you will get as many files as partitions, which is usually
not that high.



-----
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-Writable-and-Spark-serialization-tp5721p5729.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Hadoop Writable and Spark serialization

Posted by Madhu <ma...@madhu.com>.
Have you tried implementing Serializable?

This is similar to what I did:

public class MySequenceFileClass implements WritableComparable, Serializable

Read as sequence file.
I tried takeSample, it works for me.

I found that if I didn't implement Serializable, I got a serialization
exception.

I didn't have to do any registration of the class.
Of course, all referenced classes must also implement Serializable.
Is that a problem in your application?



-----
Madhu
https://www.linkedin.com/in/msiddalingaiah
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Hadoop-Writable-and-Spark-serialization-tp5721p5962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.