You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Yuming Wang (JIRA)" <ji...@apache.org> on 2018/07/24 02:44:00 UTC

[jira] [Assigned] (PARQUET-1355) Improvement parquet Binary write performance

     [ https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yuming Wang reassigned PARQUET-1355:
------------------------------------

    Assignee: Yuming Wang

> Improvement parquet Binary write performance
> --------------------------------------------
>
>                 Key: PARQUET-1355
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1355
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Yuming Wang
>            Assignee: Yuming Wang
>            Priority: Major
>
> *Benchmark code*:
> {code:java}
> test("Parquet write benchmark") {
>   val count = 100 * 1024 * 1024
>   val numIters = 5
>   withTempPath { path =>
>     val benchmark = new Benchmark(s"Parquet write benchmark ${spark.sparkContext.version}", 5)
>     Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)", "timestamp").foreach { dt =>
>       benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
>         spark.range(count).selectExpr(s"cast(id as $dt) as id")
>           .write.mode("overwrite").parquet(path.getAbsolutePath)
>       }
>     }
>     benchmark.run()
>   }
> }
> {code}
> *Result*:
> {noformat}
> -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   10963 / 11344          0.0  2192675973.8       1.0X
> string type                                 28423 / 29437          0.0  5684553922.2       0.4X
> decimal(18, 0) type                         11558 / 11696          0.0  2311587203.6       0.9X
> decimal(38, 18) type                        43858 / 44432          0.0  8771537663.4       0.2X
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   11633 / 12070          0.0  2326572295.8       1.0X
> string type                                 31374 / 32178          0.0  6274760187.4       0.4X
> decimal(18, 0) type                         13019 / 13294          0.0  2603841925.4       0.9X
> decimal(38, 18) type                        50719 / 50983          0.0 10143775007.6       0.2X
> {noformat}
> The mainly is [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83] affects performance.
> If do not use the {{toByteBuffer}} when compare binary, the result is:
> {noformat}
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> long type                                   11171 / 11508          0.0  2234189382.0       1.0X
> string type                                 30072 / 30290          0.0  6014346455.4       0.4X
> decimal(18, 0) type                         12150 / 12239          0.0  2430052708.8       0.9X
> decimal(38, 18) type                        44974 / 45423          0.0  8994773738.8       0.2X
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)