You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph Batchik (JIRA)" <ji...@apache.org> on 2015/06/23 19:12:01 UTC

[jira] [Commented] (SPARK-746) Automatically Use Avro Serialization for Avro Objects

    [ https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597973#comment-14597973 ] 

Joseph Batchik commented on SPARK-746:
--------------------------------------

Spark can currently serialize the three type of Avro records if the user specifies Kryo. Specific and Reflect records serialize just fine, if the user registers them ahead of time, since Kryo can efficiently deal with serializing classes. The problem lies in generic records since Kryo cannot serialize them without a large amount of overhead. This causes issues for users who want to use Avro records during a shuffle. To alleviate this, I implemented a custom Kryo serializer for generic records that tries to reduce the amount of network IO.

https://github.com/JDrit/spark/commit/6f1106bc20eb670e963d45a191dfc4517d46543b

This works by sending a compressed form of the schema with each message over have Kryo serialize the in-memory representation itself. Since the same schema is going to be sent numerous times, it caches previously seen values as to reduce the computation needed. It also allows users to register their schemas ahead of time. This allows it to just send the schema’s unique ID with each message, over the entire schema itself.

Could I get some feedback about this approach or let me know if I am missing anything important.

> Automatically Use Avro Serialization for Avro Objects
> -----------------------------------------------------
>
>                 Key: SPARK-746
>                 URL: https://issues.apache.org/jira/browse/SPARK-746
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Patrick Cogan
>
> All generated objects extend org.apache.avro.specific.SpecificRecordBase (or there may be a higher up class as well).
> Since Avro records aren't JavaSerializable by default people currently have to wrap their records. It would be good if we could use an implicit conversion to do this for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org