You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/18 13:58:14 UTC
[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #3983: Spark: Spark3 ZOrder Rewrite Strategy
RussellSpitzer edited a comment on pull request #3983:
URL: https://github.com/apache/iceberg/pull/3983#issuecomment-1044564560
66c77fad0 - Benchmark
IcebergSortCompactionBenchmark.java
https://github.com/apache/iceberg/blob/66c77fad0c9c14b479909f0c40e8a222d35c00b2/spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/action/IcebergSortCompactionBenchmark.java
g :iceberg-spark:iceberg-spark-3.2:jmh -PjmhIncludeRegex=IcebergSortCompactionBenchmark -PjmhOutputPath=benchmark/SortResult.txt
Minimal effect on the timing of non-string compactions. Currently String is set at a buffer length of 128bytes which probably explains the dramatically more expensive ZOrder for operations containing Strings. I think additionally the Spark version here can bail on the first misaligned byte, while we read the entire buffer before we make any decision.
Memory is managed in the ZOrder method via thread local bytebuffers owned by the Zorder Udf's created in the sort. I need to check this over but I believe I did it correctly.
Test makes 8 files , 10 million Records each and runs the Iceberg Compaction Algorithm
Schema of the table is
```
.withColumnRenamed("id", "longCol")
.withColumn("intCol", expr("CAST(longCol AS INT)"))
.withColumn("intCol2", expr("CAST(longCol AS INT)"))
.withColumn("intCol3", expr("CAST(longCol AS INT)"))
.withColumn("intCol4", expr("CAST(longCol AS INT)"))
.withColumn("floatCol", expr("CAST(longCol AS FLOAT)"))
.withColumn("doubleCol", expr("CAST(longCol AS DOUBLE)"))
.withColumn("dateCol", date_add(current_date(), col("intCol").mod(NUM_FILES)))
.withColumn("timestampCol", expr("TO_TIMESTAMP(dateCol)"))
.withColumn("stringCol", expr("CAST(dateCol AS STRING)"));
```
Test rewrites
```
Benchmark Mode Cnt Score Error Units
// Use 1 Int Column for sorting
IcebergSortCompactionBenchmark.sortInt ss 3 344.850 ± 146.488 s/op
IcebergSortCompactionBenchmark.zSortInt ss 3 370.162 ± 23.263 s/op
// Use 2 Int Columns for sorting
IcebergSortCompactionBenchmark.sortInt2 ss 3 331.688 ± 64.011 s/op
IcebergSortCompactionBenchmark.zSortInt2 ss 3 384.922 ± 141.313 s/op
// Use 3 Int Columns for sorting
IcebergSortCompactionBenchmark.sortInt3 ss 3 331.971 ± 91.621 s/op
IcebergSortCompactionBenchmark.zSortInt3 ss 3 398.508 ± 40.745 s/op
// Use 4 Int Columns for sorting
IcebergSortCompactionBenchmark.sortInt4 ss 3 345.414 ± 54.801 s/op
IcebergSortCompactionBenchmark.zSortInt4 ss 3 403.732 ± 43.947 s/op
// Use just a string column for sorting
IcebergSortCompactionBenchmark.sortString ss 3 449.647 ± 874.281 s/op
IcebergSortCompactionBenchmark.zSortString ss 3 823.717 ± 1306.732 s/op
// Contains String Column String, Int, Date, Double
IcebergSortCompactionBenchmark.sortFourColumns ss 3 292.972 ± 1978.359 s/op
IcebergSortCompactionBenchmark.zSortFourColumns ss 3 913.779 ± 717.318 s/op
// Contains "stringCol", "intCol", "dateCol", "timestampCol", "doubleCol", "longCol"
IcebergSortCompactionBenchmark.sortSixColumns ss 3 419.047 ± 113.040 s/op
IcebergSortCompactionBenchmark.zSortSixColumns ss 3 1024.332 ± 568.328 s/op
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org