You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/05/14 09:38:00 UTC
[jira] [Commented] (PARQUET-1355) Improvement Binary write
performance
[ https://issues.apache.org/jira/browse/PARQUET-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839258#comment-16839258 ]
ASF GitHub Bot commented on PARQUET-1355:
-----------------------------------------
wangyum commented on pull request #505: PARQUET-1355: Improvement Binary write performance
URL: https://github.com/apache/parquet-mr/pull/505
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
> Improvement Binary write performance
> ------------------------------------
>
> Key: PARQUET-1355
> URL: https://issues.apache.org/jira/browse/PARQUET-1355
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.10.0
> Reporter: Yuming Wang
> Assignee: Yuming Wang
> Priority: Major
> Labels: pull-request-available
>
> *Benchmark code*:
> {code:java}
> test("Parquet write benchmark") {
> val count = 100 * 1024 * 1024
> val numIters = 5
> withTempPath { path =>
> val benchmark = new Benchmark(s"Parquet write benchmark ${spark.sparkContext.version}", 5)
> Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)").foreach { dt =>
> benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
> spark.range(count).selectExpr(s"cast(id as $dt) as id")
> .write.mode("overwrite").parquet(path.getAbsolutePath)
> }
> }
> benchmark.run()
> }
> }
> {code}
> *Result*:
> {noformat}
> -- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.3.3-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 10963 / 11344 0.0 2192675973.8 1.0X
> string type 28423 / 29437 0.0 5684553922.2 0.4X
> decimal(18, 0) type 11558 / 11696 0.0 2311587203.6 0.9X
> decimal(38, 18) type 43858 / 44432 0.0 8771537663.4 0.2X
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 11633 / 12070 0.0 2326572295.8 1.0X
> string type 31374 / 32178 0.0 6274760187.4 0.4X
> decimal(18, 0) type 13019 / 13294 0.0 2603841925.4 0.9X
> decimal(38, 18) type 50719 / 50983 0.0 10143775007.6 0.2X
> {noformat}
> The mainly affects the performance is [toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83].
> If don't use the {{toByteBuffer}} when compare binary, the result is:
> {noformat}
> -- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
> Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
> Parquet write benchmark 2.4.0-SNAPSHOT: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
> ------------------------------------------------------------------------------------------------
> long type 11171 / 11508 0.0 2234189382.0 1.0X
> string type 30072 / 30290 0.0 6014346455.4 0.4X
> decimal(18, 0) type 12150 / 12239 0.0 2430052708.8 0.9X
> decimal(38, 18) type 44974 / 45423 0.0 8994773738.8 0.2X
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)