You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gavin Liu <il...@gmail.com> on 2015/07/05 08:31:06 UTC

Why Kryo Serializer is slower than Java Serializer in TeraSort

Hi,

I am using TeraSort benchmark from ehiggs's branch 
https://github.com/ehiggs/spark-terasort
<https://github.com/ehiggs/spark-terasort>  . Then I noticed that in
TeraSort.scala, it is using Kryo Serializer. So I made a small change from
"org.apache.spark.serializer.KryoSerializer" to
"org.apache.spark.serializer.JavaSerializer" to see the time difference.

Curiously, using Java Serializer is much quicker than using Kryo and there
is no error reported when I run the program. Here is the record from history
server, first one is kryo. second one is java default. 

1.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/kryo.png> 

2.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/java.png> 

I am wondering if I did something wrong or there is any other reason behind
this result.

Thanks for any help,
Gavin



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

Posted by Gylfi <gy...@berkeley.edu>.
Hi. 

Just a few quick comment on your question. 

If you drill into (click the link of the subtasks) you can get more detailed
view of the tasks. 
One of the things reported is the time for serialization. 
If that is your dominant factor it should be reflected there, right? 

Are you sure the input data is not getting cached between runs (i.e. does
the order of the experiments matter and did you explicitly flush the
operation system memory between runs etc. etc.)? 
If you now run the old experiment again, does it take the same amount of
time again? 

Did you validate that the results where actually correct? 

Hope this helps..

Regards, 
    Gylfi.  



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621p23659.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Looks like, it spend more time writing/transferring the 40GB of shuffle
when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle?

Thanks
Best Regards

On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu <il...@gmail.com>
wrote:

> Hi,
>
> I am using TeraSort benchmark from ehiggs's branch
> https://github.com/ehiggs/spark-terasort
> <https://github.com/ehiggs/spark-terasort>  . Then I noticed that in
> TeraSort.scala, it is using Kryo Serializer. So I made a small change from
> "org.apache.spark.serializer.KryoSerializer" to
> "org.apache.spark.serializer.JavaSerializer" to see the time difference.
>
> Curiously, using Java Serializer is much quicker than using Kryo and there
> is no error reported when I run the program. Here is the record from
> history
> server, first one is kryo. second one is java default.
>
> 1.
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/kryo.png>
>
> 2.
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n23621/java.png>
>
> I am wondering if I did something wrong or there is any other reason behind
> this result.
>
> Thanks for any help,
> Gavin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-Kryo-Serializer-is-slower-than-Java-Serializer-in-TeraSort-tp23621.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>