You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Baohe Zhang (Jira)" <ji...@apache.org> on 2021/02/02 21:49:00 UTC

[jira] [Created] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance

Baohe Zhang created SPARK-34336:
-----------------------------------

             Summary: Use GenericData as Avro serialization data model can improve Avro write/read performance
                 Key: SPARK-34336
                 URL: https://issues.apache.org/jira/browse/SPARK-34336
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output, SQL
    Affects Versions: 3.1.2
            Reporter: Baohe Zhang


We found that using "org.apache.avro.generic.GenericData" as Avro serialization data model in Avro writer can significantly improve Avro write performance and slightly improve Avro read performance.

This optimization was originally put up by [~samkhan]  in this PR https://github.com/apache/spark/pull/29354.

We re-evaluated the change "Use GenericData instead of ReflectData when writing Avro data" in that PR and verified it can provide performance improvement in Avro write/read benchmarks.

The base branch is today(2/2/21)'s branch-3.1.

Besides current Avro read/write benchmarks, I also ran some extra benchmarks for nested structs and arrays read/write, these benchmarks were put up in this PR https://github.com/apache/spark/pull/29352 but haven't been merged.

Benchmark results are added in the comment.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org