You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Baohe Zhang (Jira)" <ji...@apache.org> on 2021/02/02 21:49:00 UTC
[jira] [Created] (SPARK-34336) Use GenericData as Avro
serialization data model can improve Avro write/read performance
Baohe Zhang created SPARK-34336:
-----------------------------------
Summary: Use GenericData as Avro serialization data model can improve Avro write/read performance
Key: SPARK-34336
URL: https://issues.apache.org/jira/browse/SPARK-34336
Project: Spark
Issue Type: Improvement
Components: Input/Output, SQL
Affects Versions: 3.1.2
Reporter: Baohe Zhang
We found that using "org.apache.avro.generic.GenericData" as Avro serialization data model in Avro writer can significantly improve Avro write performance and slightly improve Avro read performance.
This optimization was originally put up by [~samkhan] in this PR https://github.com/apache/spark/pull/29354.
We re-evaluated the change "Use GenericData instead of ReflectData when writing Avro data" in that PR and verified it can provide performance improvement in Avro write/read benchmarks.
The base branch is today(2/2/21)'s branch-3.1.
Besides current Avro read/write benchmarks, I also ran some extra benchmarks for nested structs and arrays read/write, these benchmarks were put up in this PR https://github.com/apache/spark/pull/29352 but haven't been merged.
Benchmark results are added in the comment.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org