You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2017/10/27 08:04:00 UTC

[jira] [Commented] (SPARK-22367) Separate the serialization of class and object for iteraor

    [ https://issues.apache.org/jira/browse/SPARK-22367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16221899#comment-16221899 ] 

Apache Spark commented on SPARK-22367:
--------------------------------------

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19586

> Separate the serialization of class and object for iteraor
> ----------------------------------------------------------
>
>                 Key: SPARK-22367
>                 URL: https://issues.apache.org/jira/browse/SPARK-22367
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Xianyang Liu
>
> Becuase they are all the same class for an iterator.  So there is no need write class information for every record in the iterator. We only need write the class information once at the serialization beginning, also only need read the class information once for deserialization.
> In this patch, we separate the serialization of class and object for an iterator serialized by Kryo. This can improve the performance of the serialization and deserialization, and save the space.
> Test case:
> ```scala
>     val conf = new SparkConf().setAppName("Test for serialization")
>     val sc = new SparkContext(conf)
>     val random = new Random(1)
>     val data = sc.parallelize(1 to 1000000000).map { i =>
>       Person("id-" + i, random.nextInt(Integer.MAX_VALUE))
>     }.persist(StorageLevel.OFF_HEAP)
>     var start = System.currentTimeMillis()
>     data.count()
>     println("First time: " + (System.currentTimeMillis() - start))
>     start = System.currentTimeMillis()
>     data.count()
>     println("Second time: " + (System.currentTimeMillis() - start))
> ```
> Test result:
> The size of serialized:
> before: 34.3GB
> after: 17.5GB
> | before(cal+serialization)| before(deserialization)| after(cal+serialization)| after(deserialization) |
> | ------| ------ | ------ | ------ | 
> | 63869| 21882|  45513| 15158|
> | 59368| 21507|  51683| 15524|
> | 66230| 21481|  62163| 14903|
> | 62399| 22529|  52400| 16255|
> | 137564.2 | 136990.8 | 1.004186 | 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org