You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/18 13:58:14 UTC

[GitHub] [iceberg] RussellSpitzer edited a comment on pull request #3983: Spark: Spark3 ZOrder Rewrite Strategy

RussellSpitzer edited a comment on pull request #3983:
URL: https://github.com/apache/iceberg/pull/3983#issuecomment-1044564560


   66c77fad0 - Benchmark
   IcebergSortCompactionBenchmark.java 
   https://github.com/apache/iceberg/blob/66c77fad0c9c14b479909f0c40e8a222d35c00b2/spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/action/IcebergSortCompactionBenchmark.java
    
    g :iceberg-spark:iceberg-spark-3.2:jmh -PjmhIncludeRegex=IcebergSortCompactionBenchmark -PjmhOutputPath=benchmark/SortResult.txt
   
   Minimal effect on the timing of non-string compactions. Currently String is set at a buffer length of 128bytes which probably explains the dramatically more expensive ZOrder for operations containing Strings. I think additionally the Spark version here can bail on the first misaligned byte, while we read the entire buffer before we make any decision.
   
   Memory is managed in the ZOrder method via thread local bytebuffers owned by the Zorder Udf's created in the sort.  I need to check this over but I believe I did it correctly.
   
    Test makes 8 files , 10 million Records each and runs the Iceberg Compaction Algorithm
    Schema of the table is 
    
    ```
            .withColumnRenamed("id", "longCol")
           .withColumn("intCol", expr("CAST(longCol AS INT)"))
           .withColumn("intCol2", expr("CAST(longCol AS INT)"))
           .withColumn("intCol3", expr("CAST(longCol AS INT)"))
           .withColumn("intCol4", expr("CAST(longCol AS INT)"))
           .withColumn("floatCol", expr("CAST(longCol AS FLOAT)"))
           .withColumn("doubleCol", expr("CAST(longCol AS DOUBLE)"))
           .withColumn("dateCol", date_add(current_date(), col("intCol").mod(NUM_FILES)))
           .withColumn("timestampCol", expr("TO_TIMESTAMP(dateCol)"))
           .withColumn("stringCol", expr("CAST(dateCol AS STRING)"));
   ```
   
   Test rewrites 
   
   ```
   Benchmark                                        Mode  Cnt     Score      Error  Units
   // Use 1 Int Column for sorting
   IcebergSortCompactionBenchmark.sortInt             ss    3   344.850 ±  146.488   s/op
   IcebergSortCompactionBenchmark.zSortInt            ss    3   370.162 ±   23.263   s/op
   
   // Use 2 Int Columns for sorting
   IcebergSortCompactionBenchmark.sortInt2            ss    3   331.688 ±   64.011   s/op
   IcebergSortCompactionBenchmark.zSortInt2           ss    3   384.922 ±  141.313   s/op
   
   // Use 3 Int Columns for sorting
   IcebergSortCompactionBenchmark.sortInt3            ss    3   331.971 ±   91.621   s/op
   IcebergSortCompactionBenchmark.zSortInt3           ss    3   398.508 ±   40.745   s/op
   
   // Use 4 Int Columns for sorting
   IcebergSortCompactionBenchmark.sortInt4            ss    3   345.414 ±   54.801   s/op
   IcebergSortCompactionBenchmark.zSortInt4           ss    3   403.732 ±   43.947   s/op
   
   // Use just a string column for sorting
   IcebergSortCompactionBenchmark.sortString          ss    3   449.647 ±  874.281   s/op 
   IcebergSortCompactionBenchmark.zSortString         ss    3   823.717 ± 1306.732   s/op
   
   // Contains String Column String, Int, Date, Double
   IcebergSortCompactionBenchmark.sortFourColumns     ss    3   292.972 ± 1978.359   s/op
   IcebergSortCompactionBenchmark.zSortFourColumns    ss    3   913.779 ±  717.318   s/op
   
   // Contains "stringCol", "intCol", "dateCol", "timestampCol", "doubleCol", "longCol"
   IcebergSortCompactionBenchmark.sortSixColumns      ss    3   419.047 ±  113.040   s/op
   IcebergSortCompactionBenchmark.zSortSixColumns     ss    3  1024.332 ±  568.328   s/op 
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org