You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/12 06:15:01 UTC
[GitHub] [spark] sadikovi commented on pull request #37485: [SPARK-40052][SQL] Handle direct byte buffers in VectorizedDeltaBinaryPackedReader
sadikovi commented on PR #37485:
URL: https://github.com/apache/spark/pull/37485#issuecomment-1212765192
I reran the benchmarks again, on a larger 4x dataset (I changed the size in DataSourceReadBenchmark). The numbers are still very similar with the patch performing slightly better than the current code. I don't quite understand how that is possible unless the benchmark does not exercise the encoding.
### Before
```
OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on Linux 5.4.0-1071-aws
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1 672 707 45 93.6 10.7 1.0X
ParquetReader Vectorized: DataPageV2 945 1012 95 66.6 15.0 0.7X
ParquetReader Vectorized -> Row: DataPageV1 383 432 28 164.4 6.1 1.8X
ParquetReader Vectorized -> Row: DataPageV2 670 678 8 93.9 10.6 1.0X
OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on Linux 5.4.0-1071-aws
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
ParquetReader Vectorized: DataPageV1 931 935 4 67.6 14.8 1.0X
ParquetReader Vectorized: DataPageV2 1475 1477 4 42.7 23.4 0.6X
ParquetReader Vectorized -> Row: DataPageV1 638 650 14 98.5 10.1 1.5X
ParquetReader Vectorized -> Row: DataPageV2 1172 1173 2 53.7 18.6 0.8X
```
### After
```
[info] OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on Linux 5.4.0-1071-aws
[info] Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[info] Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ---------------------------------------------------------------------------------------------------------------------------
[info] ParquetReader Vectorized: DataPageV1 656 704 60 95.9 10.4 1.0X
[info] ParquetReader Vectorized: DataPageV2 888 898 12 70.9 14.1 0.7X
[info] ParquetReader Vectorized -> Row: DataPageV1 393 435 24 160.2 6.2 1.7X
[info] ParquetReader Vectorized -> Row: DataPageV2 667 681 12 94.3 10.6 1.0X
[info] OpenJDK 64-Bit Server VM 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07 on Linux 5.4.0-1071-aws
[info] Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
[info] Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ---------------------------------------------------------------------------------------------------------------------------
[info] ParquetReader Vectorized: DataPageV1 935 953 16 67.3 14.9 1.0X
[info] ParquetReader Vectorized: DataPageV2 1437 1440 4 43.8 22.8 0.7X
[info] ParquetReader Vectorized -> Row: DataPageV1 717 731 12 87.7 11.4 1.3X
[info] ParquetReader Vectorized -> Row: DataPageV2 1176 1185 13 53.5 18.7 0.8X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org