You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Yuming Wang (JIRA)" <ji...@apache.org> on 2018/07/24 02:44:00 UTC
[jira] [Assigned] (PARQUET-1355) Improvement parquet Binary write
performance
[ https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang reassigned PARQUET-1355:
------------------------------------
Assignee: Yuming Wang
> Improvement parquet Binary write performance
> --------------------------------------------
>
> Key: PARQUET-1355
> URL: https://issues.apache.org/jira/browse/PARQUET-1355
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.0
> Reporter: Yuming Wang
> Assignee: Yuming Wang
> Priority: Major
>
> *Benchmark code*:
> {code:java}
> test("Parquet write benchmark") {
> val count = 100 * 1024 * 1024
> val numIters = 5
> withTempPath { path =>
> val benchmark = new Benchmark(s"Parquet write benchmark ${spark.sparkContext.version}", 5)
> Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)", "timestamp").foreach { dt =>
> benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
> spark.range(count).selectExpr(s"cast(id as $dt) as id")
> .write.mode("overwrite").parquet(path.getAbsolutePath)
> }
> }
> benchmark.run()
> }
> }
> {code}
> *Result*:
> {noformat}
> -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.3.3-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 10963 / 11344 0.0 2192675973.8 1.0X
> string type 28423 / 29437 0.0 5684553922.2 0.4X
> decimal(18, 0) type 11558 / 11696 0.0 2311587203.6 0.9X
> decimal(38, 18) type 43858 / 44432 0.0 8771537663.4 0.2X
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 11633 / 12070 0.0 2326572295.8 1.0X
> string type 31374 / 32178 0.0 6274760187.4 0.4X
> decimal(18, 0) type 13019 / 13294 0.0 2603841925.4 0.9X
> decimal(38, 18) type 50719 / 50983 0.0 10143775007.6 0.2X
> {noformat}
> The mainly is [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83] affects performance.
> If do not use the {{toByteBuffer}} when compare binary, the result is:
> {noformat}
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 11171 / 11508 0.0 2234189382.0 1.0X
> string type 30072 / 30290 0.0 6014346455.4 0.4X
> decimal(18, 0) type 12150 / 12239 0.0 2430052708.8 0.9X
> decimal(38, 18) type 44974 / 45423 0.0 8994773738.8 0.2X
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)