You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by suman bharadwaj <su...@gmail.com> on 2014/01/22 21:27:04 UTC

KRYO usage details: Need Help

Hi,

I'm using the below SPARK Code. Currently i have a file of size 25 MB. And
I'm trying to do a comparative study on Kryo and Java serialization.

I had couple of questions:

1. How do you know which classes to register in Kryo ? [ highlighted in
yellow ]
2. When data is small, I'm seeing Java Serialization has better performance
than Kryo. so was wondering whether the below code represents  the correct
usage of Kryo ?

*import org.apache.spark._*
*import com.esotericsoftware.kryo.Kryo*
*import org.apache.spark.serializer.KryoRegistrator*
*import org.apache.hadoop.io.LongWritable*
*import org.apache.hadoop.io.Text*
*import org.apache.spark.storage.StorageLevel*

*class MyRegistrator extends KryoRegistrator {*
*  override def registerClasses(kryo: Kryo) {*
*        kryo.register(classOf[LongWritable])*
*        kryo.register(classOf[Text])*
*        kryo.register(classOf[Integer])*
*        kryo.register(classOf[Array[String]])*
*  }*
*}*

*object HTest {*

*  def main(args: Array[String]) {*
*        System.setProperty("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")*
*        System.setProperty("spark.kryo.registrator", "MyRegistrator")*
*        val sc = new SparkContext("local[4]","Test")*
*        val input =
sc.textFile("/home/Test/DataSet/cd7a58dc-2053-4811-8463-b144781352ac_000004.csv").persist(StorageLevel.MEMORY_ONLY_SER)*
*        println(input.count())*
*        Thread.sleep(30000L)*
*        println(input.count())*
*        Thread.sleep(30000L)*
*  }*
*}*

Your Help is Highly appreciated.

Regards,
SB

Re: KRYO usage details: Need Help

Posted by suman bharadwaj <su...@gmail.com>.
Brilliant. Thanks !!

Regards,
SB


On Thu, Jan 23, 2014 at 3:42 AM, Matei Zaharia <ma...@gmail.com>wrote:

> You should register each class you plan to use within your RDDs. In your
> case you only have an RDD of Strings, so you don’t even really need a
> registrator (strings are registered by default). But if you made custom
> objects you would use one.
>
> To speed up Kryo you can also add kryo.setReferences(false) or set spark.kryo.referenceTracking
> = false. This disables tracking of circular references. But in general
> benchmarking on this small amount of data, you’ll probably have noise from
> the JVM starting up.
>
> Matei
>
> On Jan 22, 2014, at 12:27 PM, suman bharadwaj <su...@gmail.com> wrote:
>
> Hi,
>
> I'm using the below SPARK Code. Currently i have a file of size 25 MB. And
> I'm trying to do a comparative study on Kryo and Java serialization.
>
> I had couple of questions:
>
> 1. How do you know which classes to register in Kryo ? [ highlighted in
> yellow ]
> 2. When data is small, I'm seeing Java Serialization has better
> performance than Kryo. so was wondering whether the below code represents
>  the correct usage of Kryo ?
>
> *import org.apache.spark._*
> *import com.esotericsoftware.kryo.Kryo*
> *import org.apache.spark.serializer.KryoRegistrator*
> *import org.apache.hadoop.io.LongWritable*
> *import org.apache.hadoop.io.Text*
> *import org.apache.spark.storage.StorageLevel*
>
> *class MyRegistrator extends KryoRegistrator {*
> *  override def registerClasses(kryo: Kryo) {*
> *        kryo.register(classOf[LongWritable])*
> *        kryo.register(classOf[Text])*
> *        kryo.register(classOf[Integer])*
> *        kryo.register(classOf[Array[String]])*
> *  }*
> *}*
>
> *object HTest {*
>
>  *  def main(args: Array[String]) {*
> *        System.setProperty("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")*
> *        System.setProperty("spark.kryo.registrator", "MyRegistrator")*
> *        val sc = new SparkContext("local[4]","Test")*
> *        val input =
> sc.textFile("/home/Test/DataSet/cd7a58dc-2053-4811-8463-b144781352ac_000004.csv").persist(StorageLevel.MEMORY_ONLY_SER)*
> *        println(input.count())*
> *        Thread.sleep(30000L)*
> *        println(input.count())*
> *        Thread.sleep(30000L)*
> *  }*
> *}*
>
> Your Help is Highly appreciated.
>
> Regards,
> SB
>
>
>

Re: KRYO usage details: Need Help

Posted by Matei Zaharia <ma...@gmail.com>.
You should register each class you plan to use within your RDDs. In your case you only have an RDD of Strings, so you don’t even really need a registrator (strings are registered by default). But if you made custom objects you would use one.

To speed up Kryo you can also add kryo.setReferences(false) or set spark.kryo.referenceTracking = false. This disables tracking of circular references. But in general benchmarking on this small amount of data, you’ll probably have noise from the JVM starting up.

Matei

On Jan 22, 2014, at 12:27 PM, suman bharadwaj <su...@gmail.com> wrote:

> Hi,
> 
> I'm using the below SPARK Code. Currently i have a file of size 25 MB. And I'm trying to do a comparative study on Kryo and Java serialization.
> 
> I had couple of questions:
> 
> 1. How do you know which classes to register in Kryo ? [ highlighted in yellow ]
> 2. When data is small, I'm seeing Java Serialization has better performance than Kryo. so was wondering whether the below code represents  the correct usage of Kryo ?
> 
> import org.apache.spark._
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.serializer.KryoRegistrator
> import org.apache.hadoop.io.LongWritable
> import org.apache.hadoop.io.Text
> import org.apache.spark.storage.StorageLevel
> 
> class MyRegistrator extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
>         kryo.register(classOf[LongWritable])
>         kryo.register(classOf[Text])
>         kryo.register(classOf[Integer])
>         kryo.register(classOf[Array[String]])
>   }
> }
> 
> object HTest {
> 
>   def main(args: Array[String]) {
>         System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>         System.setProperty("spark.kryo.registrator", "MyRegistrator")
>         val sc = new SparkContext("local[4]","Test")
>         val input = sc.textFile("/home/Test/DataSet/cd7a58dc-2053-4811-8463-b144781352ac_000004.csv").persist(StorageLevel.MEMORY_ONLY_SER)
>         println(input.count())
>         Thread.sleep(30000L)
>         println(input.count())
>         Thread.sleep(30000L)
>   }
> }
> 
> Your Help is Highly appreciated.
> 
> Regards,
> SB