You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by su...@apache.org on 2022/03/31 20:11:17 UTC
[spark] branch branch-3.3 updated: [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
This is an automated email from the ASF dual-hosted git repository.
sunchao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.3 by this push:
new 4e787dc [SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
4e787dc is described below
commit 4e787dc0fc9e23a27f7dfdd5cf1c5bdf91e4775f
Author: Parth Chandra <pa...@apache.org>
AuthorDate: Thu Mar 31 13:06:50 2022 -0700
[SPARK-37974][SQL] Implement vectorized DELTA_BYTE_ARRAY and DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
### What changes were proposed in this pull request?
This PR provides a vectorized implementation of the DELTA_BYTE_ARRAY encoding of Parquet V2. The PR also implements the DELTA_LENGTH_BYTE_ARRAY encoding which is needed by the former.
### Why are the changes needed?
The current support for Parquet V2 in the vectorized reader uses a non-vectorized version of the above encoding and needs to be vectorized.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Reproduces all the tests for the encodings from the Parquet implementation. Also adds more cases to the Parquet Encoding test suite.
Closes #35262 from parthchandra/SPARK-36879-PR3.
Lead-authored-by: Parth Chandra <pa...@apache.org>
Co-authored-by: Chao Sun <su...@apache.org>
Signed-off-by: Chao Sun <su...@apple.com>
---
.../DataSourceReadBenchmark-jdk11-results.txt | 424 +++++++++----------
.../DataSourceReadBenchmark-jdk17-results.txt | 470 ++++++++++-----------
.../benchmarks/DataSourceReadBenchmark-results.txt | 424 +++++++++----------
.../parquet/SpecificParquetRecordReaderBase.java | 11 +
.../parquet/VectorizedColumnReader.java | 15 +-
.../parquet/VectorizedDeltaBinaryPackedReader.java | 6 +
.../parquet/VectorizedDeltaByteArrayReader.java | 113 ++++-
.../VectorizedDeltaLengthByteArrayReader.java | 89 ++++
.../parquet/VectorizedParquetRecordReader.java | 3 +-
.../parquet/VectorizedValuesReader.java | 16 +
.../execution/vectorized/OffHeapColumnVector.java | 5 +
.../execution/vectorized/OnHeapColumnVector.java | 6 +
.../execution/vectorized/WritableColumnVector.java | 7 +
.../ParquetDeltaByteArrayEncodingSuite.scala | 143 +++++++
.../ParquetDeltaLengthByteArrayEncodingSuite.scala | 142 +++++++
.../datasources/parquet/ParquetEncodingSuite.scala | 93 ++--
16 files changed, 1250 insertions(+), 717 deletions(-)
diff --git a/sql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt b/sql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt
index 25c43d8..11fc934 100644
--- a/sql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt
+++ b/sql/core/benchmarks/DataSourceReadBenchmark-jdk11-results.txt
@@ -2,322 +2,322 @@
SQL Single Numeric Column Scan
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 9636 9771 191 1.6 612.6 1.0X
-SQL Json 7960 8227 378 2.0 506.1 1.2X
-SQL Parquet Vectorized: DataPageV1 113 129 12 139.7 7.2 85.6X
-SQL Parquet Vectorized: DataPageV2 84 93 12 186.6 5.4 114.3X
-SQL Parquet MR: DataPageV1 1466 1470 6 10.7 93.2 6.6X
-SQL Parquet MR: DataPageV2 1334 1347 18 11.8 84.8 7.2X
-SQL ORC Vectorized 163 197 27 96.3 10.4 59.0X
-SQL ORC MR 1554 1558 6 10.1 98.8 6.2X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 11809 12046 335 1.3 750.8 1.0X
+SQL Json 8588 8592 7 1.8 546.0 1.4X
+SQL Parquet Vectorized: DataPageV1 140 162 18 112.0 8.9 84.1X
+SQL Parquet Vectorized: DataPageV2 103 117 12 152.6 6.6 114.6X
+SQL Parquet MR: DataPageV1 1634 1648 20 9.6 103.9 7.2X
+SQL Parquet MR: DataPageV2 1495 1501 9 10.5 95.1 7.9X
+SQL ORC Vectorized 180 224 42 87.4 11.4 65.6X
+SQL ORC MR 1536 1576 57 10.2 97.7 7.7X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 94 103 13 167.1 6.0 1.0X
-ParquetReader Vectorized: DataPageV2 77 86 11 204.3 4.9 1.2X
-ParquetReader Vectorized -> Row: DataPageV1 44 47 4 357.0 2.8 2.1X
-ParquetReader Vectorized -> Row: DataPageV2 35 37 3 445.2 2.2 2.7X
+ParquetReader Vectorized: DataPageV1 109 114 10 144.3 6.9 1.0X
+ParquetReader Vectorized: DataPageV2 90 93 3 175.3 5.7 1.2X
+ParquetReader Vectorized -> Row: DataPageV1 58 60 4 271.9 3.7 1.9X
+ParquetReader Vectorized -> Row: DataPageV2 39 41 3 404.0 2.5 2.8X
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 11479 11919 622 1.4 729.8 1.0X
-SQL Json 9894 9922 39 1.6 629.1 1.2X
-SQL Parquet Vectorized: DataPageV1 123 156 30 128.3 7.8 93.6X
-SQL Parquet Vectorized: DataPageV2 126 138 19 125.2 8.0 91.4X
-SQL Parquet MR: DataPageV1 1986 2500 726 7.9 126.3 5.8X
-SQL Parquet MR: DataPageV2 1810 1898 126 8.7 115.1 6.3X
-SQL ORC Vectorized 174 210 30 90.5 11.0 66.1X
-SQL ORC MR 1645 1652 9 9.6 104.6 7.0X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 14515 14526 16 1.1 922.8 1.0X
+SQL Json 9862 9863 2 1.6 627.0 1.5X
+SQL Parquet Vectorized: DataPageV1 144 167 31 109.5 9.1 101.1X
+SQL Parquet Vectorized: DataPageV2 139 159 27 113.4 8.8 104.6X
+SQL Parquet MR: DataPageV1 1777 1780 3 8.8 113.0 8.2X
+SQL Parquet MR: DataPageV2 1690 1691 2 9.3 107.4 8.6X
+SQL ORC Vectorized 201 238 46 78.3 12.8 72.2X
+SQL ORC MR 1513 1522 14 10.4 96.2 9.6X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 166 177 14 94.9 10.5 1.0X
-ParquetReader Vectorized: DataPageV2 165 172 11 95.3 10.5 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 95 100 5 165.7 6.0 1.7X
-ParquetReader Vectorized -> Row: DataPageV2 85 89 6 186.0 5.4 2.0X
+ParquetReader Vectorized: DataPageV1 182 192 11 86.6 11.5 1.0X
+ParquetReader Vectorized: DataPageV2 181 188 7 86.9 11.5 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 96 99 4 163.3 6.1 1.9X
+ParquetReader Vectorized -> Row: DataPageV2 96 99 3 163.4 6.1 1.9X
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 12176 12646 664 1.3 774.1 1.0X
-SQL Json 9696 9729 46 1.6 616.5 1.3X
-SQL Parquet Vectorized: DataPageV1 151 201 33 103.9 9.6 80.4X
-SQL Parquet Vectorized: DataPageV2 216 235 15 72.7 13.8 56.3X
-SQL Parquet MR: DataPageV1 1915 2017 145 8.2 121.8 6.4X
-SQL Parquet MR: DataPageV2 1954 1978 33 8.0 124.3 6.2X
-SQL ORC Vectorized 197 235 25 79.7 12.6 61.7X
-SQL ORC MR 1769 1829 85 8.9 112.5 6.9X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 15326 15437 156 1.0 974.4 1.0X
+SQL Json 10281 10290 13 1.5 653.7 1.5X
+SQL Parquet Vectorized: DataPageV1 164 212 36 95.9 10.4 93.4X
+SQL Parquet Vectorized: DataPageV2 230 244 11 68.5 14.6 66.7X
+SQL Parquet MR: DataPageV1 2108 2111 4 7.5 134.0 7.3X
+SQL Parquet MR: DataPageV2 1940 1963 33 8.1 123.3 7.9X
+SQL ORC Vectorized 229 279 34 68.7 14.6 66.9X
+SQL ORC MR 1903 1906 3 8.3 121.0 8.1X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 230 237 12 68.5 14.6 1.0X
-ParquetReader Vectorized: DataPageV2 293 298 9 53.6 18.7 0.8X
-ParquetReader Vectorized -> Row: DataPageV1 215 265 23 73.2 13.7 1.1X
-ParquetReader Vectorized -> Row: DataPageV2 279 301 32 56.3 17.8 0.8X
+ParquetReader Vectorized: DataPageV1 253 262 10 62.2 16.1 1.0X
+ParquetReader Vectorized: DataPageV2 323 327 9 48.8 20.5 0.8X
+ParquetReader Vectorized -> Row: DataPageV1 280 288 8 56.3 17.8 0.9X
+ParquetReader Vectorized -> Row: DataPageV2 301 314 21 52.2 19.1 0.8X
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 13069 13409 482 1.2 830.9 1.0X
-SQL Json 10599 10621 32 1.5 673.9 1.2X
-SQL Parquet Vectorized: DataPageV1 142 177 34 110.6 9.0 91.9X
-SQL Parquet Vectorized: DataPageV2 313 359 28 50.2 19.9 41.7X
-SQL Parquet MR: DataPageV1 1979 2044 92 7.9 125.8 6.6X
-SQL Parquet MR: DataPageV2 1958 2030 101 8.0 124.5 6.7X
-SQL ORC Vectorized 277 303 21 56.7 17.6 47.1X
-SQL ORC MR 1692 1782 128 9.3 107.6 7.7X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 16756 16776 28 0.9 1065.3 1.0X
+SQL Json 10690 10692 3 1.5 679.6 1.6X
+SQL Parquet Vectorized: DataPageV1 160 208 45 98.1 10.2 104.5X
+SQL Parquet Vectorized: DataPageV2 390 423 23 40.3 24.8 43.0X
+SQL Parquet MR: DataPageV1 2196 2201 8 7.2 139.6 7.6X
+SQL Parquet MR: DataPageV2 2065 2072 10 7.6 131.3 8.1X
+SQL ORC Vectorized 323 338 10 48.7 20.5 51.9X
+SQL ORC MR 1899 1906 11 8.3 120.7 8.8X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 253 269 18 62.1 16.1 1.0X
-ParquetReader Vectorized: DataPageV2 1197 1199 3 13.1 76.1 0.2X
-ParquetReader Vectorized -> Row: DataPageV1 273 361 110 57.7 17.3 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 379 438 37 41.5 24.1 0.7X
+ParquetReader Vectorized: DataPageV1 278 285 9 56.6 17.7 1.0X
+ParquetReader Vectorized: DataPageV2 514 518 2 30.6 32.7 0.5X
+ParquetReader Vectorized -> Row: DataPageV1 308 316 11 51.0 19.6 0.9X
+ParquetReader Vectorized -> Row: DataPageV2 498 525 27 31.6 31.6 0.6X
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 17143 17467 458 0.9 1089.9 1.0X
-SQL Json 11507 12198 977 1.4 731.6 1.5X
-SQL Parquet Vectorized: DataPageV1 238 253 19 66.0 15.2 71.9X
-SQL Parquet Vectorized: DataPageV2 502 567 48 31.3 31.9 34.1X
-SQL Parquet MR: DataPageV1 2333 2335 3 6.7 148.4 7.3X
-SQL Parquet MR: DataPageV2 1948 1972 34 8.1 123.8 8.8X
-SQL ORC Vectorized 389 408 20 40.5 24.7 44.1X
-SQL ORC MR 1726 1817 128 9.1 109.7 9.9X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 21841 21851 14 0.7 1388.6 1.0X
+SQL Json 12828 12843 21 1.2 815.6 1.7X
+SQL Parquet Vectorized: DataPageV1 241 279 19 65.2 15.3 90.6X
+SQL Parquet Vectorized: DataPageV2 554 596 29 28.4 35.2 39.5X
+SQL Parquet MR: DataPageV1 2404 2428 34 6.5 152.8 9.1X
+SQL Parquet MR: DataPageV2 2153 2166 18 7.3 136.9 10.1X
+SQL ORC Vectorized 417 464 62 37.7 26.5 52.4X
+SQL ORC MR 2136 2146 14 7.4 135.8 10.2X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 289 340 43 54.4 18.4 1.0X
-ParquetReader Vectorized: DataPageV2 572 609 27 27.5 36.4 0.5X
-ParquetReader Vectorized -> Row: DataPageV1 329 353 48 47.8 20.9 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 639 654 18 24.6 40.6 0.5X
+ParquetReader Vectorized: DataPageV1 324 357 34 48.6 20.6 1.0X
+ParquetReader Vectorized: DataPageV2 694 702 11 22.6 44.2 0.5X
+ParquetReader Vectorized -> Row: DataPageV1 378 385 8 41.6 24.0 0.9X
+ParquetReader Vectorized -> Row: DataPageV2 701 708 8 22.4 44.6 0.5X
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 13721 13812 129 1.1 872.4 1.0X
-SQL Json 12147 17632 2196 1.3 772.3 1.1X
-SQL Parquet Vectorized: DataPageV1 138 164 25 113.9 8.8 99.4X
-SQL Parquet Vectorized: DataPageV2 151 180 26 104.4 9.6 91.1X
-SQL Parquet MR: DataPageV1 2006 2078 101 7.8 127.6 6.8X
-SQL Parquet MR: DataPageV2 2038 2040 2 7.7 129.6 6.7X
-SQL ORC Vectorized 465 475 10 33.8 29.6 29.5X
-SQL ORC MR 1814 1860 64 8.7 115.4 7.6X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 17238 17239 2 0.9 1096.0 1.0X
+SQL Json 12295 12307 18 1.3 781.7 1.4X
+SQL Parquet Vectorized: DataPageV1 162 203 27 96.8 10.3 106.1X
+SQL Parquet Vectorized: DataPageV2 157 194 32 100.4 10.0 110.0X
+SQL Parquet MR: DataPageV1 2163 2165 3 7.3 137.5 8.0X
+SQL Parquet MR: DataPageV2 2014 2014 1 7.8 128.0 8.6X
+SQL ORC Vectorized 458 462 5 34.4 29.1 37.7X
+SQL ORC MR 1984 1984 0 7.9 126.1 8.7X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 275 404 187 57.2 17.5 1.0X
-ParquetReader Vectorized: DataPageV2 275 287 12 57.2 17.5 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 227 265 24 69.2 14.4 1.2X
-ParquetReader Vectorized -> Row: DataPageV2 228 259 28 69.1 14.5 1.2X
+ParquetReader Vectorized: DataPageV1 252 259 10 62.3 16.0 1.0X
+ParquetReader Vectorized: DataPageV2 252 256 9 62.3 16.0 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 259 307 40 60.7 16.5 1.0X
+ParquetReader Vectorized -> Row: DataPageV2 260 295 25 60.5 16.5 1.0X
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 17269 17620 496 0.9 1097.9 1.0X
-SQL Json 15636 15952 447 1.0 994.1 1.1X
-SQL Parquet Vectorized: DataPageV1 238 267 18 66.0 15.1 72.5X
-SQL Parquet Vectorized: DataPageV2 222 260 21 70.9 14.1 77.9X
-SQL Parquet MR: DataPageV1 2418 2457 56 6.5 153.7 7.1X
-SQL Parquet MR: DataPageV2 2194 2207 18 7.2 139.5 7.9X
-SQL ORC Vectorized 519 528 14 30.3 33.0 33.3X
-SQL ORC MR 1760 1770 14 8.9 111.9 9.8X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 22485 22536 72 0.7 1429.5 1.0X
+SQL Json 16281 16286 8 1.0 1035.1 1.4X
+SQL Parquet Vectorized: DataPageV1 232 288 35 67.9 14.7 97.1X
+SQL Parquet Vectorized: DataPageV2 277 290 9 56.8 17.6 81.2X
+SQL Parquet MR: DataPageV1 2331 2341 15 6.7 148.2 9.6X
+SQL Parquet MR: DataPageV2 2216 2229 18 7.1 140.9 10.1X
+SQL ORC Vectorized 561 569 9 28.0 35.7 40.1X
+SQL ORC MR 2118 2137 27 7.4 134.6 10.6X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 284 305 30 55.3 18.1 1.0X
-ParquetReader Vectorized: DataPageV2 286 286 1 55.1 18.2 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 325 337 16 48.4 20.6 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 346 361 16 45.5 22.0 0.8X
+ParquetReader Vectorized: DataPageV1 355 356 1 44.3 22.6 1.0X
+ParquetReader Vectorized: DataPageV2 355 356 1 44.3 22.6 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 379 386 9 41.5 24.1 0.9X
+ParquetReader Vectorized -> Row: DataPageV2 379 389 10 41.5 24.1 0.9X
================================================================================================
Int and String Scan
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Int and String Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 12428 12714 405 0.8 1185.2 1.0X
-SQL Json 11088 11251 231 0.9 1057.4 1.1X
-SQL Parquet Vectorized: DataPageV1 1990 1997 10 5.3 189.8 6.2X
-SQL Parquet Vectorized: DataPageV2 2551 2618 95 4.1 243.3 4.9X
-SQL Parquet MR: DataPageV1 3903 3913 15 2.7 372.2 3.2X
-SQL Parquet MR: DataPageV2 3734 3920 263 2.8 356.1 3.3X
-SQL ORC Vectorized 2153 2155 3 4.9 205.3 5.8X
-SQL ORC MR 3485 3549 91 3.0 332.4 3.6X
+SQL CSV 15733 15738 8 0.7 1500.4 1.0X
+SQL Json 11953 11969 22 0.9 1140.0 1.3X
+SQL Parquet Vectorized: DataPageV1 2100 2137 52 5.0 200.2 7.5X
+SQL Parquet Vectorized: DataPageV2 2525 2535 14 4.2 240.8 6.2X
+SQL Parquet MR: DataPageV1 4075 4110 49 2.6 388.6 3.9X
+SQL Parquet MR: DataPageV2 3991 4014 34 2.6 380.6 3.9X
+SQL ORC Vectorized 2323 2355 45 4.5 221.5 6.8X
+SQL ORC MR 3776 3882 150 2.8 360.1 4.2X
================================================================================================
Repeated String Scan
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Repeated String: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 7116 7167 72 1.5 678.7 1.0X
-SQL Json 6700 6741 58 1.6 639.0 1.1X
-SQL Parquet Vectorized: DataPageV1 526 556 36 19.9 50.1 13.5X
-SQL Parquet Vectorized: DataPageV2 518 533 15 20.2 49.4 13.7X
-SQL Parquet MR: DataPageV1 1504 1656 216 7.0 143.4 4.7X
-SQL Parquet MR: DataPageV2 1676 1676 1 6.3 159.8 4.2X
-SQL ORC Vectorized 497 518 20 21.1 47.4 14.3X
-SQL ORC MR 1657 1787 183 6.3 158.1 4.3X
+SQL CSV 8921 8966 63 1.2 850.7 1.0X
+SQL Json 7215 7218 5 1.5 688.1 1.2X
+SQL Parquet Vectorized: DataPageV1 604 627 23 17.3 57.6 14.8X
+SQL Parquet Vectorized: DataPageV2 606 620 18 17.3 57.8 14.7X
+SQL Parquet MR: DataPageV1 1686 1693 10 6.2 160.8 5.3X
+SQL Parquet MR: DataPageV2 1660 1665 8 6.3 158.3 5.4X
+SQL ORC Vectorized 541 548 7 19.4 51.6 16.5X
+SQL ORC MR 1920 1930 13 5.5 183.1 4.6X
================================================================================================
Partitioned Table Scan
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------
-Data column - CSV 18247 18411 232 0.9 1160.1 1.0X
-Data column - Json 10860 11264 571 1.4 690.5 1.7X
-Data column - Parquet Vectorized: DataPageV1 223 274 26 70.6 14.2 81.9X
-Data column - Parquet Vectorized: DataPageV2 537 559 23 29.3 34.1 34.0X
-Data column - Parquet MR: DataPageV1 2411 2517 150 6.5 153.3 7.6X
-Data column - Parquet MR: DataPageV2 2299 2356 81 6.8 146.2 7.9X
-Data column - ORC Vectorized 417 433 11 37.7 26.5 43.8X
-Data column - ORC MR 2107 2178 101 7.5 134.0 8.7X
-Partition column - CSV 6090 6186 136 2.6 387.2 3.0X
-Partition column - Json 9479 9603 176 1.7 602.7 1.9X
-Partition column - Parquet Vectorized: DataPageV1 49 69 28 322.0 3.1 373.6X
-Partition column - Parquet Vectorized: DataPageV2 49 63 23 322.1 3.1 373.7X
-Partition column - Parquet MR: DataPageV1 1200 1225 36 13.1 76.3 15.2X
-Partition column - Parquet MR: DataPageV2 1199 1240 57 13.1 76.3 15.2X
-Partition column - ORC Vectorized 53 77 26 295.0 3.4 342.2X
-Partition column - ORC MR 1287 1346 83 12.2 81.8 14.2X
-Both columns - CSV 17671 18140 663 0.9 1123.5 1.0X
-Both columns - Json 11675 12167 696 1.3 742.3 1.6X
-Both columns - Parquet Vectorized: DataPageV1 298 303 9 52.9 18.9 61.3X
-Both columns - Parquet Vectorized: DataPageV2 541 580 36 29.1 34.4 33.7X
-Both columns - Parquet MR: DataPageV1 2448 2491 60 6.4 155.6 7.5X
-Both columns - Parquet MR: DataPageV2 2303 2352 69 6.8 146.4 7.9X
-Both columns - ORC Vectorized 385 406 25 40.9 24.5 47.4X
-Both columns - ORC MR 2118 2202 120 7.4 134.6 8.6X
+Data column - CSV 21951 21976 36 0.7 1395.6 1.0X
+Data column - Json 12896 12905 14 1.2 819.9 1.7X
+Data column - Parquet Vectorized: DataPageV1 247 307 48 63.6 15.7 88.7X
+Data column - Parquet Vectorized: DataPageV2 657 686 25 23.9 41.8 33.4X
+Data column - Parquet MR: DataPageV1 2705 2708 3 5.8 172.0 8.1X
+Data column - Parquet MR: DataPageV2 2621 2621 0 6.0 166.6 8.4X
+Data column - ORC Vectorized 440 468 30 35.7 28.0 49.9X
+Data column - ORC MR 2553 2565 17 6.2 162.3 8.6X
+Partition column - CSV 6640 6641 1 2.4 422.2 3.3X
+Partition column - Json 10499 10512 19 1.5 667.5 2.1X
+Partition column - Parquet Vectorized: DataPageV1 60 79 24 261.4 3.8 364.8X
+Partition column - Parquet Vectorized: DataPageV2 58 81 26 270.2 3.7 377.0X
+Partition column - Parquet MR: DataPageV1 1387 1412 35 11.3 88.2 15.8X
+Partition column - Parquet MR: DataPageV2 1383 1407 34 11.4 87.9 15.9X
+Partition column - ORC Vectorized 61 85 25 256.8 3.9 358.4X
+Partition column - ORC MR 1552 1553 1 10.1 98.7 14.1X
+Both columns - CSV 21896 21919 32 0.7 1392.1 1.0X
+Both columns - Json 13645 13664 27 1.2 867.5 1.6X
+Both columns - Parquet Vectorized: DataPageV1 307 351 33 51.3 19.5 71.6X
+Both columns - Parquet Vectorized: DataPageV2 698 740 36 22.5 44.4 31.4X
+Both columns - Parquet MR: DataPageV1 2804 2821 24 5.6 178.3 7.8X
+Both columns - Parquet MR: DataPageV2 2624 2636 16 6.0 166.8 8.4X
+Both columns - ORC Vectorized 462 521 53 34.0 29.4 47.5X
+Both columns - ORC MR 2564 2580 22 6.1 163.0 8.6X
================================================================================================
String with Nulls Scan
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (0.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 7966 12723 2892 1.3 759.7 1.0X
-SQL Json 9897 10008 157 1.1 943.9 0.8X
-SQL Parquet Vectorized: DataPageV1 1176 1264 125 8.9 112.1 6.8X
-SQL Parquet Vectorized: DataPageV2 2224 2326 144 4.7 212.1 3.6X
-SQL Parquet MR: DataPageV1 3431 3483 73 3.1 327.2 2.3X
-SQL Parquet MR: DataPageV2 3845 4043 280 2.7 366.7 2.1X
-ParquetReader Vectorized: DataPageV1 1055 1056 2 9.9 100.6 7.6X
-ParquetReader Vectorized: DataPageV2 2093 2119 37 5.0 199.6 3.8X
-SQL ORC Vectorized 1129 1217 125 9.3 107.7 7.1X
-SQL ORC MR 2931 2982 72 3.6 279.5 2.7X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 10818 10826 11 1.0 1031.6 1.0X
+SQL Json 10812 10833 29 1.0 1031.2 1.0X
+SQL Parquet Vectorized: DataPageV1 1301 1312 15 8.1 124.1 8.3X
+SQL Parquet Vectorized: DataPageV2 1953 1982 42 5.4 186.2 5.5X
+SQL Parquet MR: DataPageV1 3677 3680 5 2.9 350.6 2.9X
+SQL Parquet MR: DataPageV2 3970 3972 2 2.6 378.6 2.7X
+ParquetReader Vectorized: DataPageV1 1004 1016 16 10.4 95.8 10.8X
+ParquetReader Vectorized: DataPageV2 1606 1622 22 6.5 153.2 6.7X
+SQL ORC Vectorized 1160 1182 30 9.0 110.7 9.3X
+SQL ORC MR 3266 3330 90 3.2 311.4 3.3X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (50.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 6338 6508 240 1.7 604.4 1.0X
-SQL Json 7149 7247 138 1.5 681.8 0.9X
-SQL Parquet Vectorized: DataPageV1 937 984 45 11.2 89.3 6.8X
-SQL Parquet Vectorized: DataPageV2 1582 1608 37 6.6 150.9 4.0X
-SQL Parquet MR: DataPageV1 2525 2721 277 4.2 240.8 2.5X
-SQL Parquet MR: DataPageV2 2969 2974 7 3.5 283.1 2.1X
-ParquetReader Vectorized: DataPageV1 933 940 12 11.2 88.9 6.8X
-ParquetReader Vectorized: DataPageV2 1535 1549 20 6.8 146.4 4.1X
-SQL ORC Vectorized 1144 1204 86 9.2 109.1 5.5X
-SQL ORC MR 2816 2822 8 3.7 268.6 2.3X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 7971 7981 15 1.3 760.2 1.0X
+SQL Json 8266 8269 3 1.3 788.4 1.0X
+SQL Parquet Vectorized: DataPageV1 1025 1036 15 10.2 97.8 7.8X
+SQL Parquet Vectorized: DataPageV2 1432 1440 11 7.3 136.6 5.6X
+SQL Parquet MR: DataPageV1 2792 2806 20 3.8 266.3 2.9X
+SQL Parquet MR: DataPageV2 2958 2992 47 3.5 282.1 2.7X
+ParquetReader Vectorized: DataPageV1 1010 1024 20 10.4 96.3 7.9X
+ParquetReader Vectorized: DataPageV2 1331 1335 4 7.9 127.0 6.0X
+SQL ORC Vectorized 1266 1271 6 8.3 120.8 6.3X
+SQL ORC MR 3032 3089 81 3.5 289.2 2.6X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (95.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 4443 4504 86 2.4 423.7 1.0X
-SQL Json 4528 4563 49 2.3 431.8 1.0X
-SQL Parquet Vectorized: DataPageV1 213 233 15 49.2 20.3 20.8X
-SQL Parquet Vectorized: DataPageV2 267 294 22 39.3 25.4 16.7X
-SQL Parquet MR: DataPageV1 1691 1700 13 6.2 161.2 2.6X
-SQL Parquet MR: DataPageV2 1515 1565 70 6.9 144.5 2.9X
-ParquetReader Vectorized: DataPageV1 228 231 2 46.0 21.7 19.5X
-ParquetReader Vectorized: DataPageV2 285 296 9 36.8 27.1 15.6X
-SQL ORC Vectorized 369 425 82 28.4 35.2 12.1X
-SQL ORC MR 1457 1463 9 7.2 138.9 3.0X
+SQL CSV 5829 5833 5 1.8 555.9 1.0X
+SQL Json 4966 4978 17 2.1 473.6 1.2X
+SQL Parquet Vectorized: DataPageV1 236 244 7 44.5 22.5 24.7X
+SQL Parquet Vectorized: DataPageV2 305 315 13 34.4 29.1 19.1X
+SQL Parquet MR: DataPageV1 1777 1784 10 5.9 169.5 3.3X
+SQL Parquet MR: DataPageV2 1635 1637 4 6.4 155.9 3.6X
+ParquetReader Vectorized: DataPageV1 242 246 2 43.2 23.1 24.0X
+ParquetReader Vectorized: DataPageV2 309 313 7 34.0 29.5 18.9X
+SQL ORC Vectorized 391 419 53 26.8 37.3 14.9X
+SQL ORC MR 1686 1687 1 6.2 160.8 3.5X
================================================================================================
Single Column Scan From Wide Columns
================================================================================================
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 2374 2377 5 0.4 2264.2 1.0X
-SQL Json 2693 2726 46 0.4 2568.5 0.9X
-SQL Parquet Vectorized: DataPageV1 44 62 16 23.8 42.0 54.0X
-SQL Parquet Vectorized: DataPageV2 63 81 21 16.5 60.5 37.5X
-SQL Parquet MR: DataPageV1 173 198 27 6.1 164.6 13.8X
-SQL Parquet MR: DataPageV2 161 193 30 6.5 153.5 14.8X
-SQL ORC Vectorized 53 71 18 19.9 50.2 45.1X
-SQL ORC MR 149 182 34 7.0 142.3 15.9X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 2301 2305 6 0.5 2194.0 1.0X
+SQL Json 2874 2895 29 0.4 2741.1 0.8X
+SQL Parquet Vectorized: DataPageV1 47 66 20 22.3 44.8 48.9X
+SQL Parquet Vectorized: DataPageV2 74 90 25 14.2 70.5 31.1X
+SQL Parquet MR: DataPageV1 198 219 26 5.3 189.0 11.6X
+SQL Parquet MR: DataPageV2 178 207 45 5.9 170.1 12.9X
+SQL ORC Vectorized 59 76 20 17.6 56.7 38.7X
+SQL ORC MR 173 193 24 6.1 164.6 13.3X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 50 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 5149 5193 62 0.2 4910.9 1.0X
-SQL Json 10556 10891 475 0.1 10066.5 0.5X
-SQL Parquet Vectorized: DataPageV1 64 96 28 16.3 61.3 80.1X
-SQL Parquet Vectorized: DataPageV2 83 106 22 12.6 79.1 62.0X
-SQL Parquet MR: DataPageV1 196 232 25 5.3 187.4 26.2X
-SQL Parquet MR: DataPageV2 184 221 28 5.7 175.1 28.0X
-SQL ORC Vectorized 74 98 31 14.1 70.8 69.3X
-SQL ORC MR 182 214 38 5.8 173.9 28.2X
-
-OpenJDK 64-Bit Server VM 11.0.13+8-LTS on Linux 5.11.0-1025-azure
+SQL CSV 5418 5425 9 0.2 5167.2 1.0X
+SQL Json 11463 11574 156 0.1 10932.3 0.5X
+SQL Parquet Vectorized: DataPageV1 66 103 28 15.8 63.4 81.5X
+SQL Parquet Vectorized: DataPageV2 90 115 27 11.7 85.5 60.4X
+SQL Parquet MR: DataPageV1 218 234 23 4.8 208.3 24.8X
+SQL Parquet MR: DataPageV2 199 225 29 5.3 190.1 27.2X
+SQL ORC Vectorized 76 106 31 13.7 72.8 71.0X
+SQL ORC MR 193 216 28 5.4 184.2 28.0X
+
+OpenJDK 64-Bit Server VM 11.0.14+9-LTS on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 100 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 9077 9107 43 0.1 8656.2 1.0X
-SQL Json 20131 20886 1067 0.1 19198.5 0.5X
-SQL Parquet Vectorized: DataPageV1 93 124 26 11.3 88.8 97.5X
-SQL Parquet Vectorized: DataPageV2 103 128 29 10.2 98.5 87.9X
-SQL Parquet MR: DataPageV1 218 257 35 4.8 207.6 41.7X
-SQL Parquet MR: DataPageV2 213 255 29 4.9 202.7 42.7X
-SQL ORC Vectorized 80 95 20 13.0 76.6 112.9X
-SQL ORC MR 187 207 20 5.6 178.0 48.6X
+SQL CSV 9430 9430 0 0.1 8993.3 1.0X
+SQL Json 21268 21347 111 0.0 20283.1 0.4X
+SQL Parquet Vectorized: DataPageV1 97 124 24 10.9 92.1 97.6X
+SQL Parquet Vectorized: DataPageV2 119 136 19 8.8 113.6 79.2X
+SQL Parquet MR: DataPageV1 254 285 35 4.1 242.1 37.1X
+SQL Parquet MR: DataPageV2 231 260 30 4.5 220.0 40.9X
+SQL ORC Vectorized 95 119 31 11.1 90.4 99.5X
+SQL ORC MR 214 219 5 4.9 203.6 44.2X
diff --git a/sql/core/benchmarks/DataSourceReadBenchmark-jdk17-results.txt b/sql/core/benchmarks/DataSourceReadBenchmark-jdk17-results.txt
index ecba57c..8ff1764 100644
--- a/sql/core/benchmarks/DataSourceReadBenchmark-jdk17-results.txt
+++ b/sql/core/benchmarks/DataSourceReadBenchmark-jdk17-results.txt
@@ -2,322 +2,322 @@
SQL Single Numeric Column Scan
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 15972 16369 561 1.0 1015.5 1.0X
-SQL Json 9543 9580 54 1.6 606.7 1.7X
-SQL Parquet Vectorized: DataPageV1 115 144 19 136.3 7.3 138.4X
-SQL Parquet Vectorized: DataPageV2 95 109 15 165.1 6.1 167.6X
-SQL Parquet MR: DataPageV1 2098 2119 30 7.5 133.4 7.6X
-SQL Parquet MR: DataPageV2 2007 2012 6 7.8 127.6 8.0X
-SQL ORC Vectorized 211 225 16 74.5 13.4 75.7X
-SQL ORC MR 2077 2103 36 7.6 132.1 7.7X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 9610 10067 646 1.6 611.0 1.0X
+SQL Json 8316 8410 133 1.9 528.7 1.2X
+SQL Parquet Vectorized: DataPageV1 123 145 10 127.7 7.8 78.0X
+SQL Parquet Vectorized: DataPageV2 93 108 12 170.0 5.9 103.8X
+SQL Parquet MR: DataPageV1 1766 1768 2 8.9 112.3 5.4X
+SQL Parquet MR: DataPageV2 1540 1543 3 10.2 97.9 6.2X
+SQL ORC Vectorized 175 182 6 89.6 11.2 54.8X
+SQL ORC MR 1517 1533 22 10.4 96.5 6.3X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 43 47 2 369.4 2.7 1.0X
-ParquetReader Vectorized: DataPageV2 30 34 2 518.5 1.9 1.4X
-ParquetReader Vectorized -> Row: DataPageV1 47 50 2 333.6 3.0 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 31 35 2 504.8 2.0 1.4X
+ParquetReader Vectorized: DataPageV1 61 63 2 256.3 3.9 1.0X
+ParquetReader Vectorized: DataPageV2 44 45 2 356.3 2.8 1.4X
+ParquetReader Vectorized -> Row: DataPageV1 51 51 1 311.3 3.2 1.2X
+ParquetReader Vectorized -> Row: DataPageV2 32 33 2 492.4 2.0 1.9X
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 17468 17543 105 0.9 1110.6 1.0X
-SQL Json 11059 11065 8 1.4 703.1 1.6X
-SQL Parquet Vectorized: DataPageV1 128 142 15 123.1 8.1 136.7X
-SQL Parquet Vectorized: DataPageV2 126 141 8 125.2 8.0 139.1X
-SQL Parquet MR: DataPageV1 2305 2331 36 6.8 146.5 7.6X
-SQL Parquet MR: DataPageV2 2075 2095 28 7.6 131.9 8.4X
-SQL ORC Vectorized 172 191 16 91.5 10.9 101.6X
-SQL ORC MR 1777 1796 26 8.8 113.0 9.8X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 14866 14885 26 1.1 945.2 1.0X
+SQL Json 9585 9586 3 1.6 609.4 1.6X
+SQL Parquet Vectorized: DataPageV1 119 131 12 132.4 7.6 125.2X
+SQL Parquet Vectorized: DataPageV2 119 125 5 132.0 7.6 124.7X
+SQL Parquet MR: DataPageV1 1954 2025 101 8.0 124.2 7.6X
+SQL Parquet MR: DataPageV2 1800 1824 35 8.7 114.4 8.3X
+SQL ORC Vectorized 169 176 6 93.0 10.8 87.9X
+SQL ORC MR 1432 1467 50 11.0 91.0 10.4X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 72 77 5 219.4 4.6 1.0X
-ParquetReader Vectorized: DataPageV2 72 77 3 217.9 4.6 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 76 83 6 206.6 4.8 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 75 80 3 210.3 4.8 1.0X
+ParquetReader Vectorized: DataPageV1 118 120 2 133.0 7.5 1.0X
+ParquetReader Vectorized: DataPageV2 119 120 2 132.6 7.5 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 72 73 2 218.1 4.6 1.6X
+ParquetReader Vectorized -> Row: DataPageV2 72 74 2 217.7 4.6 1.6X
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 18330 18332 3 0.9 1165.4 1.0X
-SQL Json 11383 11429 66 1.4 723.7 1.6X
-SQL Parquet Vectorized: DataPageV1 179 197 13 88.0 11.4 102.5X
-SQL Parquet Vectorized: DataPageV2 239 263 18 65.7 15.2 76.6X
-SQL Parquet MR: DataPageV1 2552 2567 21 6.2 162.3 7.2X
-SQL Parquet MR: DataPageV2 2389 2436 67 6.6 151.9 7.7X
-SQL ORC Vectorized 246 263 14 64.0 15.6 74.6X
-SQL ORC MR 1965 2002 52 8.0 124.9 9.3X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 14601 14699 139 1.1 928.3 1.0X
+SQL Json 9446 9517 101 1.7 600.5 1.5X
+SQL Parquet Vectorized: DataPageV1 156 168 15 101.1 9.9 93.8X
+SQL Parquet Vectorized: DataPageV2 197 213 15 79.6 12.6 73.9X
+SQL Parquet MR: DataPageV1 2113 2130 23 7.4 134.4 6.9X
+SQL Parquet MR: DataPageV2 1739 1784 64 9.0 110.5 8.4X
+SQL ORC Vectorized 192 205 10 81.9 12.2 76.0X
+SQL ORC MR 1518 1588 100 10.4 96.5 9.6X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 253 263 11 62.2 16.1 1.0X
-ParquetReader Vectorized: DataPageV2 306 317 7 51.4 19.4 0.8X
-ParquetReader Vectorized -> Row: DataPageV1 246 250 4 64.0 15.6 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 316 321 4 49.8 20.1 0.8X
+ParquetReader Vectorized: DataPageV1 215 221 6 73.2 13.7 1.0X
+ParquetReader Vectorized: DataPageV2 269 278 8 58.5 17.1 0.8X
+ParquetReader Vectorized -> Row: DataPageV1 206 208 2 76.2 13.1 1.0X
+ParquetReader Vectorized -> Row: DataPageV2 244 262 10 64.4 15.5 0.9X
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 19573 19822 352 0.8 1244.4 1.0X
-SQL Json 12141 12217 107 1.3 771.9 1.6X
-SQL Parquet Vectorized: DataPageV1 192 222 28 81.8 12.2 101.8X
-SQL Parquet Vectorized: DataPageV2 345 373 24 45.6 21.9 56.7X
-SQL Parquet MR: DataPageV1 2736 2741 7 5.7 173.9 7.2X
-SQL Parquet MR: DataPageV2 2467 2536 97 6.4 156.9 7.9X
-SQL ORC Vectorized 332 356 20 47.4 21.1 59.0X
-SQL ORC MR 2188 2193 7 7.2 139.1 8.9X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 15886 16086 282 1.0 1010.0 1.0X
+SQL Json 9872 9880 12 1.6 627.6 1.6X
+SQL Parquet Vectorized: DataPageV1 174 195 22 90.4 11.1 91.3X
+SQL Parquet Vectorized: DataPageV2 393 409 16 40.0 25.0 40.4X
+SQL Parquet MR: DataPageV1 1953 2064 157 8.1 124.2 8.1X
+SQL Parquet MR: DataPageV2 2215 2231 23 7.1 140.8 7.2X
+SQL ORC Vectorized 280 314 22 56.1 17.8 56.7X
+SQL ORC MR 1681 1706 35 9.4 106.9 9.5X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 291 295 4 54.1 18.5 1.0X
-ParquetReader Vectorized: DataPageV2 493 518 39 31.9 31.3 0.6X
-ParquetReader Vectorized -> Row: DataPageV1 300 306 8 52.5 19.1 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 471 483 11 33.4 30.0 0.6X
+ParquetReader Vectorized: DataPageV1 253 263 8 62.1 16.1 1.0X
+ParquetReader Vectorized: DataPageV2 450 461 15 34.9 28.6 0.6X
+ParquetReader Vectorized -> Row: DataPageV1 241 253 12 65.2 15.3 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 437 448 14 36.0 27.8 0.6X
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 24692 24718 37 0.6 1569.9 1.0X
-SQL Json 14839 14875 50 1.1 943.5 1.7X
-SQL Parquet Vectorized: DataPageV1 295 316 29 53.3 18.7 83.7X
-SQL Parquet Vectorized: DataPageV2 477 505 24 32.9 30.4 51.7X
-SQL Parquet MR: DataPageV1 2841 2981 197 5.5 180.6 8.7X
-SQL Parquet MR: DataPageV2 2616 2632 23 6.0 166.3 9.4X
-SQL ORC Vectorized 388 403 11 40.5 24.7 63.6X
-SQL ORC MR 2274 2372 138 6.9 144.6 10.9X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 20641 20744 145 0.8 1312.3 1.0X
+SQL Json 13055 13122 95 1.2 830.0 1.6X
+SQL Parquet Vectorized: DataPageV1 246 267 16 63.8 15.7 83.8X
+SQL Parquet Vectorized: DataPageV2 513 532 16 30.7 32.6 40.2X
+SQL Parquet MR: DataPageV1 2354 2387 47 6.7 149.7 8.8X
+SQL Parquet MR: DataPageV2 2118 2148 43 7.4 134.6 9.7X
+SQL ORC Vectorized 418 437 17 37.6 26.6 49.4X
+SQL ORC MR 1808 1852 61 8.7 115.0 11.4X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 376 387 9 41.9 23.9 1.0X
-ParquetReader Vectorized: DataPageV2 585 591 6 26.9 37.2 0.6X
-ParquetReader Vectorized -> Row: DataPageV1 377 387 9 41.8 23.9 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 576 586 10 27.3 36.6 0.7X
+ParquetReader Vectorized: DataPageV1 306 315 5 51.5 19.4 1.0X
+ParquetReader Vectorized: DataPageV2 584 591 11 26.9 37.1 0.5X
+ParquetReader Vectorized -> Row: DataPageV1 288 299 14 54.6 18.3 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 549 557 8 28.6 34.9 0.6X
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 20566 20651 119 0.8 1307.6 1.0X
-SQL Json 14337 14409 101 1.1 911.5 1.4X
-SQL Parquet Vectorized: DataPageV1 154 167 8 101.9 9.8 133.2X
-SQL Parquet Vectorized: DataPageV2 157 178 14 99.9 10.0 130.6X
-SQL Parquet MR: DataPageV1 2730 2730 1 5.8 173.5 7.5X
-SQL Parquet MR: DataPageV2 2459 2491 45 6.4 156.3 8.4X
-SQL ORC Vectorized 479 501 15 32.9 30.4 43.0X
-SQL ORC MR 2293 2343 71 6.9 145.8 9.0X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 17024 17292 378 0.9 1082.4 1.0X
+SQL Json 11724 11904 255 1.3 745.4 1.5X
+SQL Parquet Vectorized: DataPageV1 174 186 11 90.6 11.0 98.1X
+SQL Parquet Vectorized: DataPageV2 173 189 14 90.9 11.0 98.4X
+SQL Parquet MR: DataPageV1 1932 2037 148 8.1 122.9 8.8X
+SQL Parquet MR: DataPageV2 1947 1976 41 8.1 123.8 8.7X
+SQL ORC Vectorized 432 459 36 36.4 27.5 39.4X
+SQL ORC MR 1984 1985 1 7.9 126.1 8.6X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 272 283 9 57.9 17.3 1.0X
-ParquetReader Vectorized: DataPageV2 250 288 27 62.9 15.9 1.1X
-ParquetReader Vectorized -> Row: DataPageV1 291 301 6 54.1 18.5 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 293 305 14 53.6 18.6 0.9X
+ParquetReader Vectorized: DataPageV1 257 259 2 61.2 16.3 1.0X
+ParquetReader Vectorized: DataPageV2 239 254 10 65.8 15.2 1.1X
+ParquetReader Vectorized -> Row: DataPageV1 259 260 1 60.8 16.4 1.0X
+ParquetReader Vectorized -> Row: DataPageV2 258 262 6 61.0 16.4 1.0X
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 25753 25874 171 0.6 1637.3 1.0X
-SQL Json 19097 19391 416 0.8 1214.2 1.3X
-SQL Parquet Vectorized: DataPageV1 273 288 11 57.6 17.4 94.3X
-SQL Parquet Vectorized: DataPageV2 240 277 25 65.5 15.3 107.3X
-SQL Parquet MR: DataPageV1 2969 3042 103 5.3 188.8 8.7X
-SQL Parquet MR: DataPageV2 2692 2747 78 5.8 171.1 9.6X
-SQL ORC Vectorized 601 626 20 26.2 38.2 42.8X
-SQL ORC MR 2458 2467 13 6.4 156.3 10.5X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 22592 22594 4 0.7 1436.3 1.0X
+SQL Json 16252 16271 26 1.0 1033.3 1.4X
+SQL Parquet Vectorized: DataPageV1 247 271 22 63.6 15.7 91.3X
+SQL Parquet Vectorized: DataPageV2 252 266 14 62.4 16.0 89.6X
+SQL Parquet MR: DataPageV1 2337 2352 21 6.7 148.6 9.7X
+SQL Parquet MR: DataPageV2 2187 2223 50 7.2 139.1 10.3X
+SQL ORC Vectorized 489 526 25 32.2 31.1 46.2X
+SQL ORC MR 1816 1892 107 8.7 115.5 12.4X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 354 363 7 44.4 22.5 1.0X
-ParquetReader Vectorized: DataPageV2 345 359 12 45.5 22.0 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 337 345 8 46.7 21.4 1.1X
-ParquetReader Vectorized -> Row: DataPageV2 335 364 21 46.9 21.3 1.1X
+ParquetReader Vectorized: DataPageV1 291 304 8 54.0 18.5 1.0X
+ParquetReader Vectorized: DataPageV2 298 309 7 52.9 18.9 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 330 338 16 47.7 21.0 0.9X
+ParquetReader Vectorized -> Row: DataPageV2 331 338 12 47.5 21.1 0.9X
================================================================================================
Int and String Scan
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Int and String Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 18074 18101 37 0.6 1723.7 1.0X
-SQL Json 13211 13214 5 0.8 1259.9 1.4X
-SQL Parquet Vectorized: DataPageV1 2249 2286 53 4.7 214.5 8.0X
-SQL Parquet Vectorized: DataPageV2 2804 2818 20 3.7 267.4 6.4X
-SQL Parquet MR: DataPageV1 4708 4779 100 2.2 449.0 3.8X
-SQL Parquet MR: DataPageV2 4868 5046 251 2.2 464.3 3.7X
-SQL ORC Vectorized 2145 2160 20 4.9 204.6 8.4X
-SQL ORC MR 4180 4308 182 2.5 398.6 4.3X
+SQL CSV 14365 14780 587 0.7 1369.9 1.0X
+SQL Json 10718 10772 76 1.0 1022.2 1.3X
+SQL Parquet Vectorized: DataPageV1 1932 1988 80 5.4 184.2 7.4X
+SQL Parquet Vectorized: DataPageV2 2298 2317 27 4.6 219.2 6.2X
+SQL Parquet MR: DataPageV1 3829 3957 181 2.7 365.1 3.8X
+SQL Parquet MR: DataPageV2 4176 4208 46 2.5 398.3 3.4X
+SQL ORC Vectorized 2026 2046 28 5.2 193.2 7.1X
+SQL ORC MR 3566 3580 21 2.9 340.0 4.0X
================================================================================================
Repeated String Scan
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Repeated String: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 11320 11376 78 0.9 1079.6 1.0X
-SQL Json 7593 7664 101 1.4 724.1 1.5X
-SQL Parquet Vectorized: DataPageV1 633 639 9 16.6 60.3 17.9X
-SQL Parquet Vectorized: DataPageV2 621 644 20 16.9 59.2 18.2X
-SQL Parquet MR: DataPageV1 2111 2157 65 5.0 201.3 5.4X
-SQL Parquet MR: DataPageV2 2018 2064 65 5.2 192.4 5.6X
-SQL ORC Vectorized 505 540 36 20.8 48.2 22.4X
-SQL ORC MR 2302 2360 82 4.6 219.5 4.9X
+SQL CSV 9372 9373 1 1.1 893.8 1.0X
+SQL Json 6862 6865 4 1.5 654.4 1.4X
+SQL Parquet Vectorized: DataPageV1 606 613 8 17.3 57.8 15.5X
+SQL Parquet Vectorized: DataPageV2 611 615 3 17.2 58.3 15.3X
+SQL Parquet MR: DataPageV1 1713 1721 11 6.1 163.3 5.5X
+SQL Parquet MR: DataPageV2 1721 1724 4 6.1 164.1 5.4X
+SQL ORC Vectorized 467 469 2 22.5 44.5 20.1X
+SQL ORC MR 1816 1818 2 5.8 173.2 5.2X
================================================================================================
Partitioned Table Scan
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------
-Data column - CSV 24867 25261 556 0.6 1581.0 1.0X
-Data column - Json 13937 13987 70 1.1 886.1 1.8X
-Data column - Parquet Vectorized: DataPageV1 252 264 8 62.3 16.0 98.5X
-Data column - Parquet Vectorized: DataPageV2 547 560 13 28.8 34.7 45.5X
-Data column - Parquet MR: DataPageV1 3492 3509 25 4.5 222.0 7.1X
-Data column - Parquet MR: DataPageV2 3148 3208 84 5.0 200.2 7.9X
-Data column - ORC Vectorized 493 512 21 31.9 31.3 50.5X
-Data column - ORC MR 2925 2943 26 5.4 185.9 8.5X
-Partition column - CSV 7847 7851 5 2.0 498.9 3.2X
-Partition column - Json 11759 11908 210 1.3 747.6 2.1X
-Partition column - Parquet Vectorized: DataPageV1 60 67 7 262.3 3.8 414.7X
-Partition column - Parquet Vectorized: DataPageV2 57 65 9 274.2 3.6 433.5X
-Partition column - Parquet MR: DataPageV1 1762 1768 8 8.9 112.1 14.1X
-Partition column - Parquet MR: DataPageV2 1742 1783 59 9.0 110.7 14.3X
-Partition column - ORC Vectorized 59 71 7 265.6 3.8 419.9X
-Partition column - ORC MR 1743 1764 29 9.0 110.8 14.3X
-Both columns - CSV 25859 25924 92 0.6 1644.1 1.0X
-Both columns - Json 14693 14764 101 1.1 934.2 1.7X
-Both columns - Parquet Vectorized: DataPageV1 341 395 66 46.2 21.7 73.0X
-Both columns - Parquet Vectorized: DataPageV2 624 643 13 25.2 39.7 39.9X
-Both columns - Parquet MR: DataPageV1 3541 3611 99 4.4 225.2 7.0X
-Both columns - Parquet MR: DataPageV2 3279 3301 32 4.8 208.4 7.6X
-Both columns - ORC Vectorized 434 483 40 36.2 27.6 57.3X
-Both columns - ORC MR 2946 2964 26 5.3 187.3 8.4X
+Data column - CSV 21799 22053 360 0.7 1385.9 1.0X
+Data column - Json 12978 12985 10 1.2 825.1 1.7X
+Data column - Parquet Vectorized: DataPageV1 261 277 15 60.4 16.6 83.7X
+Data column - Parquet Vectorized: DataPageV2 601 647 42 26.2 38.2 36.3X
+Data column - Parquet MR: DataPageV1 2796 2798 2 5.6 177.8 7.8X
+Data column - Parquet MR: DataPageV2 2595 2626 43 6.1 165.0 8.4X
+Data column - ORC Vectorized 428 449 25 36.8 27.2 50.9X
+Data column - ORC MR 2162 2274 159 7.3 137.5 10.1X
+Partition column - CSV 5804 5922 167 2.7 369.0 3.8X
+Partition column - Json 10410 10455 64 1.5 661.8 2.1X
+Partition column - Parquet Vectorized: DataPageV1 56 60 6 280.9 3.6 389.3X
+Partition column - Parquet Vectorized: DataPageV2 55 59 5 286.5 3.5 397.1X
+Partition column - Parquet MR: DataPageV1 1357 1357 1 11.6 86.3 16.1X
+Partition column - Parquet MR: DataPageV2 1339 1339 0 11.7 85.1 16.3X
+Partition column - ORC Vectorized 57 61 5 276.3 3.6 382.9X
+Partition column - ORC MR 1346 1351 7 11.7 85.6 16.2X
+Both columns - CSV 20812 21349 759 0.8 1323.2 1.0X
+Both columns - Json 13061 13372 440 1.2 830.4 1.7X
+Both columns - Parquet Vectorized: DataPageV1 265 275 6 59.3 16.9 82.1X
+Both columns - Parquet Vectorized: DataPageV2 619 637 20 25.4 39.4 35.2X
+Both columns - Parquet MR: DataPageV1 2827 2830 4 5.6 179.8 7.7X
+Both columns - Parquet MR: DataPageV2 2593 2603 14 6.1 164.8 8.4X
+Both columns - ORC Vectorized 391 432 37 40.2 24.9 55.7X
+Both columns - ORC MR 2438 2455 25 6.5 155.0 8.9X
================================================================================================
String with Nulls Scan
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (0.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 13698 13783 121 0.8 1306.3 1.0X
-SQL Json 11030 11144 161 1.0 1051.9 1.2X
-SQL Parquet Vectorized: DataPageV1 1695 1699 7 6.2 161.6 8.1X
-SQL Parquet Vectorized: DataPageV2 2740 2744 5 3.8 261.3 5.0X
-SQL Parquet MR: DataPageV1 4547 4594 66 2.3 433.7 3.0X
-SQL Parquet MR: DataPageV2 5382 5455 103 1.9 513.3 2.5X
-ParquetReader Vectorized: DataPageV1 1238 1238 0 8.5 118.0 11.1X
-ParquetReader Vectorized: DataPageV2 2312 2325 19 4.5 220.5 5.9X
-SQL ORC Vectorized 1134 1147 18 9.2 108.2 12.1X
-SQL ORC MR 3966 4015 69 2.6 378.2 3.5X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 10697 10736 56 1.0 1020.1 1.0X
+SQL Json 9722 9963 341 1.1 927.2 1.1X
+SQL Parquet Vectorized: DataPageV1 1337 1342 6 7.8 127.6 8.0X
+SQL Parquet Vectorized: DataPageV2 1731 1757 38 6.1 165.1 6.2X
+SQL Parquet MR: DataPageV1 3581 3584 4 2.9 341.5 3.0X
+SQL Parquet MR: DataPageV2 3996 4001 7 2.6 381.1 2.7X
+ParquetReader Vectorized: DataPageV1 1006 1015 13 10.4 96.0 10.6X
+ParquetReader Vectorized: DataPageV2 1476 1477 2 7.1 140.7 7.2X
+SQL ORC Vectorized 957 1042 120 11.0 91.3 11.2X
+SQL ORC MR 3060 3068 11 3.4 291.8 3.5X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (50.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 10613 10658 64 1.0 1012.1 1.0X
-SQL Json 8973 8996 33 1.2 855.7 1.2X
-SQL Parquet Vectorized: DataPageV1 1208 1221 18 8.7 115.2 8.8X
-SQL Parquet Vectorized: DataPageV2 1949 1950 1 5.4 185.9 5.4X
-SQL Parquet MR: DataPageV1 3701 3716 21 2.8 353.0 2.9X
-SQL Parquet MR: DataPageV2 4150 4192 60 2.5 395.8 2.6X
-ParquetReader Vectorized: DataPageV1 1191 1192 1 8.8 113.6 8.9X
-ParquetReader Vectorized: DataPageV2 1874 1917 61 5.6 178.7 5.7X
-SQL ORC Vectorized 1338 1365 38 7.8 127.6 7.9X
-SQL ORC MR 3659 3674 21 2.9 349.0 2.9X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 7299 7300 1 1.4 696.1 1.0X
+SQL Json 7453 7659 292 1.4 710.8 1.0X
+SQL Parquet Vectorized: DataPageV1 896 916 32 11.7 85.4 8.1X
+SQL Parquet Vectorized: DataPageV2 1282 1283 1 8.2 122.3 5.7X
+SQL Parquet MR: DataPageV1 2586 2678 130 4.1 246.6 2.8X
+SQL Parquet MR: DataPageV2 3061 3066 6 3.4 291.9 2.4X
+ParquetReader Vectorized: DataPageV1 913 915 3 11.5 87.0 8.0X
+ParquetReader Vectorized: DataPageV2 1181 1183 3 8.9 112.6 6.2X
+SQL ORC Vectorized 1102 1111 13 9.5 105.1 6.6X
+SQL ORC MR 2916 3002 121 3.6 278.1 2.5X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (95.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 8714 8809 134 1.2 831.0 1.0X
-SQL Json 5801 5819 25 1.8 553.2 1.5X
-SQL Parquet Vectorized: DataPageV1 297 316 11 35.3 28.3 29.3X
-SQL Parquet Vectorized: DataPageV2 363 382 12 28.9 34.6 24.0X
-SQL Parquet MR: DataPageV1 2350 2366 22 4.5 224.1 3.7X
-SQL Parquet MR: DataPageV2 2132 2183 73 4.9 203.3 4.1X
-ParquetReader Vectorized: DataPageV1 296 310 13 35.4 28.2 29.4X
-ParquetReader Vectorized: DataPageV2 368 372 3 28.5 35.1 23.7X
-SQL ORC Vectorized 474 487 10 22.1 45.2 18.4X
-SQL ORC MR 2025 2031 9 5.2 193.1 4.3X
+SQL CSV 4615 4619 6 2.3 440.1 1.0X
+SQL Json 4926 4927 1 2.1 469.8 0.9X
+SQL Parquet Vectorized: DataPageV1 240 246 5 43.8 22.9 19.3X
+SQL Parquet Vectorized: DataPageV2 287 295 4 36.5 27.4 16.1X
+SQL Parquet MR: DataPageV1 1774 1781 10 5.9 169.2 2.6X
+SQL Parquet MR: DataPageV2 1772 1773 1 5.9 169.0 2.6X
+ParquetReader Vectorized: DataPageV1 238 240 2 44.0 22.7 19.4X
+ParquetReader Vectorized: DataPageV2 285 288 3 36.8 27.2 16.2X
+SQL ORC Vectorized 382 392 6 27.4 36.5 12.1X
+SQL ORC MR 1616 1617 2 6.5 154.1 2.9X
================================================================================================
Single Column Scan From Wide Columns
================================================================================================
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 2677 2687 14 0.4 2553.2 1.0X
-SQL Json 3581 3588 10 0.3 3414.8 0.7X
-SQL Parquet Vectorized: DataPageV1 52 59 7 20.2 49.6 51.5X
-SQL Parquet Vectorized: DataPageV2 68 75 7 15.4 65.0 39.3X
-SQL Parquet MR: DataPageV1 245 257 9 4.3 233.6 10.9X
-SQL Parquet MR: DataPageV2 224 237 8 4.7 213.7 11.9X
-SQL ORC Vectorized 64 70 5 16.3 61.3 41.7X
-SQL ORC MR 208 216 8 5.0 198.2 12.9X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 2051 2052 2 0.5 1956.1 1.0X
+SQL Json 3230 3232 3 0.3 3080.6 0.6X
+SQL Parquet Vectorized: DataPageV1 45 50 7 23.2 43.2 45.3X
+SQL Parquet Vectorized: DataPageV2 67 72 8 15.6 64.1 30.5X
+SQL Parquet MR: DataPageV1 191 198 8 5.5 181.9 10.8X
+SQL Parquet MR: DataPageV2 176 181 6 6.0 167.7 11.7X
+SQL ORC Vectorized 55 60 6 19.0 52.7 37.1X
+SQL ORC MR 164 168 4 6.4 156.1 12.5X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 50 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 5753 5771 25 0.2 5486.7 1.0X
-SQL Json 13801 13851 71 0.1 13161.9 0.4X
-SQL Parquet Vectorized: DataPageV1 75 83 9 14.1 71.1 77.2X
-SQL Parquet Vectorized: DataPageV2 84 93 7 12.4 80.6 68.1X
-SQL Parquet MR: DataPageV1 269 280 7 3.9 256.5 21.4X
-SQL Parquet MR: DataPageV2 251 258 8 4.2 238.9 23.0X
-SQL ORC Vectorized 82 88 6 12.8 78.3 70.1X
-SQL ORC MR 223 239 8 4.7 213.0 25.8X
-
-OpenJDK 64-Bit Server VM 17.0.1+12-LTS on Linux 5.11.0-1025-azure
-Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
+SQL CSV 4530 4530 0 0.2 4320.0 1.0X
+SQL Json 12530 12536 9 0.1 11949.2 0.4X
+SQL Parquet Vectorized: DataPageV1 60 65 6 17.4 57.6 75.0X
+SQL Parquet Vectorized: DataPageV2 83 91 8 12.6 79.1 54.6X
+SQL Parquet MR: DataPageV1 211 216 7 5.0 201.2 21.5X
+SQL Parquet MR: DataPageV2 195 204 12 5.4 186.0 23.2X
+SQL ORC Vectorized 70 75 5 14.9 67.1 64.4X
+SQL ORC MR 182 191 11 5.8 173.5 24.9X
+
+OpenJDK 64-Bit Server VM 17.0.2+8-LTS on Linux 5.11.0-1028-azure
+Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 100 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 9487 9503 24 0.1 9047.1 1.0X
-SQL Json 26109 26240 186 0.0 24899.2 0.4X
-SQL Parquet Vectorized: DataPageV1 100 110 10 10.4 95.8 94.5X
-SQL Parquet Vectorized: DataPageV2 113 119 6 9.3 107.3 84.3X
-SQL Parquet MR: DataPageV1 280 296 11 3.7 267.2 33.9X
-SQL Parquet MR: DataPageV2 281 321 68 3.7 268.0 33.8X
-SQL ORC Vectorized 92 101 8 11.4 87.5 103.4X
-SQL ORC MR 228 245 10 4.6 217.7 41.6X
+SQL CSV 7758 7763 7 0.1 7398.8 1.0X
+SQL Json 24530 24546 23 0.0 23393.2 0.3X
+SQL Parquet Vectorized: DataPageV1 91 96 6 11.5 87.1 84.9X
+SQL Parquet Vectorized: DataPageV2 113 118 6 9.2 108.1 68.4X
+SQL Parquet MR: DataPageV1 246 254 8 4.3 234.2 31.6X
+SQL Parquet MR: DataPageV2 229 235 6 4.6 218.7 33.8X
+SQL ORC Vectorized 88 92 6 11.9 83.8 88.3X
+SQL ORC MR 205 214 9 5.1 195.2 37.9X
diff --git a/sql/core/benchmarks/DataSourceReadBenchmark-results.txt b/sql/core/benchmarks/DataSourceReadBenchmark-results.txt
index 6a2b6bf..1a7ebe5 100644
--- a/sql/core/benchmarks/DataSourceReadBenchmark-results.txt
+++ b/sql/core/benchmarks/DataSourceReadBenchmark-results.txt
@@ -2,322 +2,322 @@
SQL Single Numeric Column Scan
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 11570 12144 812 1.4 735.6 1.0X
-SQL Json 7542 7568 37 2.1 479.5 1.5X
-SQL Parquet Vectorized: DataPageV1 129 144 16 121.9 8.2 89.7X
-SQL Parquet Vectorized: DataPageV2 92 106 20 170.3 5.9 125.2X
-SQL Parquet MR: DataPageV1 1416 1419 3 11.1 90.0 8.2X
-SQL Parquet MR: DataPageV2 1281 1359 110 12.3 81.4 9.0X
-SQL ORC Vectorized 161 176 10 97.4 10.3 71.6X
-SQL ORC MR 1525 1545 29 10.3 96.9 7.6X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 12972 13210 337 1.2 824.8 1.0X
+SQL Json 7440 7634 275 2.1 473.0 1.7X
+SQL Parquet Vectorized: DataPageV1 125 137 10 125.8 8.0 103.7X
+SQL Parquet Vectorized: DataPageV2 93 103 20 168.4 5.9 138.9X
+SQL Parquet MR: DataPageV1 1621 1657 52 9.7 103.0 8.0X
+SQL Parquet MR: DataPageV2 1396 1420 34 11.3 88.7 9.3X
+SQL ORC Vectorized 178 186 16 88.5 11.3 73.0X
+SQL ORC MR 1501 1503 4 10.5 95.4 8.6X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single BOOLEAN Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 111 118 6 142.3 7.0 1.0X
-ParquetReader Vectorized: DataPageV2 116 117 2 135.7 7.4 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 48 49 1 324.9 3.1 2.3X
-ParquetReader Vectorized -> Row: DataPageV2 39 39 1 405.8 2.5 2.9X
+ParquetReader Vectorized: DataPageV1 132 134 4 119.3 8.4 1.0X
+ParquetReader Vectorized: DataPageV2 115 117 3 136.7 7.3 1.1X
+ParquetReader Vectorized -> Row: DataPageV1 57 58 1 275.1 3.6 2.3X
+ParquetReader Vectorized -> Row: DataPageV2 41 41 1 387.9 2.6 3.3X
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 13807 14535 1030 1.1 877.8 1.0X
-SQL Json 8079 8094 21 1.9 513.6 1.7X
-SQL Parquet Vectorized: DataPageV1 139 152 12 113.0 8.9 99.2X
-SQL Parquet Vectorized: DataPageV2 140 147 5 112.5 8.9 98.7X
-SQL Parquet MR: DataPageV1 1637 1741 148 9.6 104.1 8.4X
-SQL Parquet MR: DataPageV2 1522 1636 161 10.3 96.8 9.1X
-SQL ORC Vectorized 147 160 10 106.9 9.4 93.8X
-SQL ORC MR 1542 1545 4 10.2 98.1 9.0X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 15808 15867 83 1.0 1005.0 1.0X
+SQL Json 9119 9174 78 1.7 579.8 1.7X
+SQL Parquet Vectorized: DataPageV1 157 163 7 100.2 10.0 100.7X
+SQL Parquet Vectorized: DataPageV2 156 161 5 100.6 9.9 101.1X
+SQL Parquet MR: DataPageV1 1846 1871 36 8.5 117.4 8.6X
+SQL Parquet MR: DataPageV2 1702 1707 7 9.2 108.2 9.3X
+SQL ORC Vectorized 130 134 2 120.7 8.3 121.3X
+SQL ORC MR 1536 1542 9 10.2 97.7 10.3X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single TINYINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 166 171 8 94.7 10.6 1.0X
-ParquetReader Vectorized: DataPageV2 166 169 4 94.7 10.6 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 156 157 2 100.7 9.9 1.1X
-ParquetReader Vectorized -> Row: DataPageV2 156 157 2 100.7 9.9 1.1X
+ParquetReader Vectorized: DataPageV1 198 202 5 79.3 12.6 1.0X
+ParquetReader Vectorized: DataPageV2 197 199 3 79.8 12.5 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 188 190 3 83.4 12.0 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 188 190 3 83.5 12.0 1.1X
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 15327 15421 133 1.0 974.5 1.0X
-SQL Json 8564 8799 332 1.8 544.5 1.8X
-SQL Parquet Vectorized: DataPageV1 202 219 11 77.8 12.8 75.8X
-SQL Parquet Vectorized: DataPageV2 203 210 8 77.7 12.9 75.7X
-SQL Parquet MR: DataPageV1 1874 2004 183 8.4 119.2 8.2X
-SQL Parquet MR: DataPageV2 1606 1709 146 9.8 102.1 9.5X
-SQL ORC Vectorized 167 179 10 94.1 10.6 91.7X
-SQL ORC MR 1404 1408 6 11.2 89.3 10.9X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 16474 16493 27 1.0 1047.4 1.0X
+SQL Json 9477 9478 1 1.7 602.6 1.7X
+SQL Parquet Vectorized: DataPageV1 211 216 7 74.4 13.4 77.9X
+SQL Parquet Vectorized: DataPageV2 215 221 5 73.0 13.7 76.5X
+SQL Parquet MR: DataPageV1 2114 2133 28 7.4 134.4 7.8X
+SQL Parquet MR: DataPageV2 1792 1808 22 8.8 113.9 9.2X
+SQL ORC Vectorized 179 182 4 88.0 11.4 92.2X
+SQL ORC MR 1586 1588 2 9.9 100.8 10.4X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single SMALLINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 222 236 13 70.7 14.1 1.0X
-ParquetReader Vectorized: DataPageV2 259 268 14 60.8 16.5 0.9X
-ParquetReader Vectorized -> Row: DataPageV1 228 248 11 68.9 14.5 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 264 293 13 59.5 16.8 0.8X
+ParquetReader Vectorized: DataPageV1 254 257 5 62.0 16.1 1.0X
+ParquetReader Vectorized: DataPageV2 299 302 4 52.6 19.0 0.8X
+ParquetReader Vectorized -> Row: DataPageV1 236 238 4 66.7 15.0 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 281 283 4 56.0 17.9 0.9X
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 17479 17651 243 0.9 1111.3 1.0X
-SQL Json 9565 9582 25 1.6 608.1 1.8X
-SQL Parquet Vectorized: DataPageV1 152 159 8 103.2 9.7 114.7X
-SQL Parquet Vectorized: DataPageV2 290 308 18 54.2 18.4 60.3X
-SQL Parquet MR: DataPageV1 1861 1980 169 8.5 118.3 9.4X
-SQL Parquet MR: DataPageV2 1647 1748 142 9.5 104.7 10.6X
-SQL ORC Vectorized 230 251 12 68.3 14.6 75.9X
-SQL ORC MR 1645 1648 3 9.6 104.6 10.6X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 18049 18086 52 0.9 1147.5 1.0X
+SQL Json 10073 10074 1 1.6 640.4 1.8X
+SQL Parquet Vectorized: DataPageV1 177 184 9 89.1 11.2 102.3X
+SQL Parquet Vectorized: DataPageV2 301 306 6 52.2 19.1 59.9X
+SQL Parquet MR: DataPageV1 2120 2134 21 7.4 134.8 8.5X
+SQL Parquet MR: DataPageV2 1855 1893 54 8.5 117.9 9.7X
+SQL ORC Vectorized 246 249 1 63.8 15.7 73.2X
+SQL ORC MR 1655 1660 6 9.5 105.2 10.9X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single INT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 208 213 9 75.7 13.2 1.0X
-ParquetReader Vectorized: DataPageV2 355 382 14 44.3 22.6 0.6X
-ParquetReader Vectorized -> Row: DataPageV1 212 233 8 74.1 13.5 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 350 353 7 45.0 22.2 0.6X
+ParquetReader Vectorized: DataPageV1 239 243 5 65.8 15.2 1.0X
+ParquetReader Vectorized: DataPageV2 384 387 4 40.9 24.4 0.6X
+ParquetReader Vectorized -> Row: DataPageV1 223 224 3 70.7 14.2 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 366 370 7 43.0 23.3 0.7X
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 21825 21944 169 0.7 1387.6 1.0X
-SQL Json 11877 11927 71 1.3 755.1 1.8X
-SQL Parquet Vectorized: DataPageV1 229 242 18 68.8 14.5 95.5X
-SQL Parquet Vectorized: DataPageV2 435 452 23 36.1 27.7 50.1X
-SQL Parquet MR: DataPageV1 2050 2184 190 7.7 130.3 10.6X
-SQL Parquet MR: DataPageV2 1829 1927 138 8.6 116.3 11.9X
-SQL ORC Vectorized 287 308 14 54.8 18.3 76.0X
-SQL ORC MR 1579 1603 34 10.0 100.4 13.8X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 22703 22737 48 0.7 1443.4 1.0X
+SQL Json 12723 12743 28 1.2 808.9 1.8X
+SQL Parquet Vectorized: DataPageV1 228 261 76 69.1 14.5 99.7X
+SQL Parquet Vectorized: DataPageV2 465 472 7 33.8 29.5 48.9X
+SQL Parquet MR: DataPageV1 2166 2168 3 7.3 137.7 10.5X
+SQL Parquet MR: DataPageV2 1921 1936 21 8.2 122.1 11.8X
+SQL ORC Vectorized 307 313 10 51.2 19.5 73.9X
+SQL ORC MR 1730 1745 21 9.1 110.0 13.1X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single BIGINT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 299 341 86 52.6 19.0 1.0X
-ParquetReader Vectorized: DataPageV2 551 607 110 28.5 35.1 0.5X
-ParquetReader Vectorized -> Row: DataPageV1 341 344 4 46.2 21.7 0.9X
-ParquetReader Vectorized -> Row: DataPageV2 508 557 33 31.0 32.3 0.6X
+ParquetReader Vectorized: DataPageV1 309 316 10 51.0 19.6 1.0X
+ParquetReader Vectorized: DataPageV2 559 563 5 28.1 35.5 0.6X
+ParquetReader Vectorized -> Row: DataPageV1 292 296 6 53.9 18.6 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 541 547 8 29.1 34.4 0.6X
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 17585 17926 482 0.9 1118.0 1.0X
-SQL Json 11927 12180 357 1.3 758.3 1.5X
-SQL Parquet Vectorized: DataPageV1 150 161 11 104.6 9.6 116.9X
-SQL Parquet Vectorized: DataPageV2 150 160 8 104.7 9.5 117.1X
-SQL Parquet MR: DataPageV1 1830 1867 52 8.6 116.4 9.6X
-SQL Parquet MR: DataPageV2 1715 1828 160 9.2 109.1 10.3X
-SQL ORC Vectorized 328 358 15 48.0 20.8 53.6X
-SQL ORC MR 1584 1687 145 9.9 100.7 11.1X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 18790 18808 25 0.8 1194.6 1.0X
+SQL Json 11572 11579 10 1.4 735.7 1.6X
+SQL Parquet Vectorized: DataPageV1 155 158 5 101.7 9.8 121.6X
+SQL Parquet Vectorized: DataPageV2 158 162 6 99.6 10.0 119.0X
+SQL Parquet MR: DataPageV1 2041 2050 12 7.7 129.8 9.2X
+SQL Parquet MR: DataPageV2 1903 1905 3 8.3 121.0 9.9X
+SQL ORC Vectorized 357 359 2 44.1 22.7 52.7X
+SQL ORC MR 1745 1755 15 9.0 110.9 10.8X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single FLOAT Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 207 211 8 76.0 13.2 1.0X
-ParquetReader Vectorized: DataPageV2 207 220 11 75.8 13.2 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 208 214 9 75.7 13.2 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 208 213 9 75.6 13.2 1.0X
+ParquetReader Vectorized: DataPageV1 239 243 4 65.7 15.2 1.0X
+ParquetReader Vectorized: DataPageV2 240 243 4 65.7 15.2 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 221 225 4 71.1 14.1 1.1X
+ParquetReader Vectorized -> Row: DataPageV2 223 225 4 70.6 14.2 1.1X
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
SQL Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 22569 22614 63 0.7 1434.9 1.0X
-SQL Json 15590 15600 15 1.0 991.2 1.4X
-SQL Parquet Vectorized: DataPageV1 225 241 17 69.9 14.3 100.3X
-SQL Parquet Vectorized: DataPageV2 219 236 13 72.0 13.9 103.3X
-SQL Parquet MR: DataPageV1 2013 2109 136 7.8 128.0 11.2X
-SQL Parquet MR: DataPageV2 1850 1967 165 8.5 117.6 12.2X
-SQL ORC Vectorized 396 416 25 39.7 25.2 56.9X
-SQL ORC MR 1707 1763 79 9.2 108.5 13.2X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 23476 23478 3 0.7 1492.6 1.0X
+SQL Json 14568 15103 757 1.1 926.2 1.6X
+SQL Parquet Vectorized: DataPageV1 212 230 16 74.2 13.5 110.7X
+SQL Parquet Vectorized: DataPageV2 209 218 8 75.4 13.3 112.5X
+SQL Parquet MR: DataPageV1 1943 2080 194 8.1 123.5 12.1X
+SQL Parquet MR: DataPageV2 1824 1830 9 8.6 116.0 12.9X
+SQL ORC Vectorized 395 419 20 39.9 25.1 59.5X
+SQL ORC MR 1844 1855 15 8.5 117.2 12.7X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Parquet Reader Single DOUBLE Column Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------
-ParquetReader Vectorized: DataPageV1 280 298 13 56.2 17.8 1.0X
-ParquetReader Vectorized: DataPageV2 278 300 21 56.6 17.7 1.0X
-ParquetReader Vectorized -> Row: DataPageV1 280 299 13 56.2 17.8 1.0X
-ParquetReader Vectorized -> Row: DataPageV2 304 307 4 51.8 19.3 0.9X
+ParquetReader Vectorized: DataPageV1 280 322 88 56.1 17.8 1.0X
+ParquetReader Vectorized: DataPageV2 282 301 19 55.8 17.9 1.0X
+ParquetReader Vectorized -> Row: DataPageV1 284 290 4 55.3 18.1 1.0X
+ParquetReader Vectorized -> Row: DataPageV2 287 293 9 54.8 18.3 1.0X
================================================================================================
Int and String Scan
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Int and String Scan: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 15548 16002 641 0.7 1482.8 1.0X
-SQL Json 10801 11108 434 1.0 1030.1 1.4X
-SQL Parquet Vectorized: DataPageV1 1858 1966 152 5.6 177.2 8.4X
-SQL Parquet Vectorized: DataPageV2 2342 2466 175 4.5 223.4 6.6X
-SQL Parquet MR: DataPageV1 3873 3908 49 2.7 369.4 4.0X
-SQL Parquet MR: DataPageV2 3764 3869 148 2.8 358.9 4.1X
-SQL ORC Vectorized 2018 2020 3 5.2 192.5 7.7X
-SQL ORC MR 3247 3450 287 3.2 309.7 4.8X
+SQL CSV 14663 15652 1399 0.7 1398.4 1.0X
+SQL Json 10757 10845 125 1.0 1025.9 1.4X
+SQL Parquet Vectorized: DataPageV1 1815 1933 166 5.8 173.1 8.1X
+SQL Parquet Vectorized: DataPageV2 2244 2297 75 4.7 214.0 6.5X
+SQL Parquet MR: DataPageV1 3491 3685 273 3.0 333.0 4.2X
+SQL Parquet MR: DataPageV2 3600 3627 37 2.9 343.4 4.1X
+SQL ORC Vectorized 1804 1895 129 5.8 172.0 8.1X
+SQL ORC MR 3181 3379 280 3.3 303.4 4.6X
================================================================================================
Repeated String Scan
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Repeated String: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 8028 8337 436 1.3 765.6 1.0X
-SQL Json 6362 6488 178 1.6 606.7 1.3X
-SQL Parquet Vectorized: DataPageV1 642 673 51 16.3 61.3 12.5X
-SQL Parquet Vectorized: DataPageV2 646 678 40 16.2 61.6 12.4X
-SQL Parquet MR: DataPageV1 1504 1604 141 7.0 143.5 5.3X
-SQL Parquet MR: DataPageV2 1645 1646 1 6.4 156.9 4.9X
-SQL ORC Vectorized 386 415 25 27.2 36.8 20.8X
-SQL ORC MR 1704 1730 37 6.2 162.5 4.7X
+SQL CSV 8466 8778 441 1.2 807.4 1.0X
+SQL Json 6389 6454 93 1.6 609.3 1.3X
+SQL Parquet Vectorized: DataPageV1 644 675 52 16.3 61.4 13.1X
+SQL Parquet Vectorized: DataPageV2 640 668 44 16.4 61.0 13.2X
+SQL Parquet MR: DataPageV1 1579 1602 33 6.6 150.6 5.4X
+SQL Parquet MR: DataPageV2 1536 1539 4 6.8 146.5 5.5X
+SQL ORC Vectorized 439 443 4 23.9 41.9 19.3X
+SQL ORC MR 1787 1806 27 5.9 170.5 4.7X
================================================================================================
Partitioned Table Scan
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Partitioned Table: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
---------------------------------------------------------------------------------------------------------------------------------
-Data column - CSV 21472 21514 59 0.7 1365.2 1.0X
-Data column - Json 11537 11606 97 1.4 733.5 1.9X
-Data column - Parquet Vectorized: DataPageV1 238 256 11 66.1 15.1 90.2X
-Data column - Parquet Vectorized: DataPageV2 482 507 17 32.6 30.6 44.6X
-Data column - Parquet MR: DataPageV1 2213 2355 200 7.1 140.7 9.7X
-Data column - Parquet MR: DataPageV2 2036 2163 179 7.7 129.4 10.5X
-Data column - ORC Vectorized 289 310 20 54.4 18.4 74.3X
-Data column - ORC MR 1898 1936 54 8.3 120.7 11.3X
-Partition column - CSV 6307 6364 80 2.5 401.0 3.4X
-Partition column - Json 9167 9253 121 1.7 582.8 2.3X
-Partition column - Parquet Vectorized: DataPageV1 62 66 3 253.5 3.9 346.1X
-Partition column - Parquet Vectorized: DataPageV2 61 65 2 259.2 3.9 353.8X
-Partition column - Parquet MR: DataPageV1 1086 1088 3 14.5 69.0 19.8X
-Partition column - Parquet MR: DataPageV2 1091 1146 78 14.4 69.4 19.7X
-Partition column - ORC Vectorized 63 67 2 251.1 4.0 342.9X
-Partition column - ORC MR 1173 1175 3 13.4 74.6 18.3X
-Both columns - CSV 21458 22038 820 0.7 1364.3 1.0X
-Both columns - Json 12697 12712 22 1.2 807.2 1.7X
-Both columns - Parquet Vectorized: DataPageV1 275 288 10 57.2 17.5 78.0X
-Both columns - Parquet Vectorized: DataPageV2 505 525 24 31.2 32.1 42.5X
-Both columns - Parquet MR: DataPageV1 2541 2547 9 6.2 161.5 8.5X
-Both columns - Parquet MR: DataPageV2 2059 2060 2 7.6 130.9 10.4X
-Both columns - ORC Vectorized 326 349 16 48.3 20.7 66.0X
-Both columns - ORC MR 2116 2151 50 7.4 134.5 10.1X
+Data column - CSV 22527 22546 26 0.7 1432.3 1.0X
+Data column - Json 12533 12712 254 1.3 796.8 1.8X
+Data column - Parquet Vectorized: DataPageV1 229 244 14 68.7 14.6 98.3X
+Data column - Parquet Vectorized: DataPageV2 508 519 16 31.0 32.3 44.3X
+Data column - Parquet MR: DataPageV1 2525 2535 13 6.2 160.6 8.9X
+Data column - Parquet MR: DataPageV2 2194 2209 21 7.2 139.5 10.3X
+Data column - ORC Vectorized 315 317 2 50.0 20.0 71.6X
+Data column - ORC MR 2098 2100 3 7.5 133.4 10.7X
+Partition column - CSV 6747 6753 9 2.3 429.0 3.3X
+Partition column - Json 10080 10102 32 1.6 640.8 2.2X
+Partition column - Parquet Vectorized: DataPageV1 60 63 2 262.8 3.8 376.4X
+Partition column - Parquet Vectorized: DataPageV2 58 63 8 270.2 3.7 387.1X
+Partition column - Parquet MR: DataPageV1 1152 1155 4 13.6 73.3 19.5X
+Partition column - Parquet MR: DataPageV2 1149 1149 1 13.7 73.0 19.6X
+Partition column - ORC Vectorized 61 64 3 259.8 3.8 372.1X
+Partition column - ORC MR 1332 1332 0 11.8 84.7 16.9X
+Both columns - CSV 23030 23042 17 0.7 1464.2 1.0X
+Both columns - Json 13569 13581 16 1.2 862.7 1.7X
+Both columns - Parquet Vectorized: DataPageV1 268 277 11 58.7 17.0 84.0X
+Both columns - Parquet Vectorized: DataPageV2 551 557 7 28.6 35.0 40.9X
+Both columns - Parquet MR: DataPageV1 2556 2557 0 6.2 162.5 8.8X
+Both columns - Parquet MR: DataPageV2 2287 2292 7 6.9 145.4 9.9X
+Both columns - ORC Vectorized 361 363 2 43.6 22.9 62.5X
+Both columns - ORC MR 2158 2161 5 7.3 137.2 10.4X
================================================================================================
String with Nulls Scan
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (0.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 10074 10372 422 1.0 960.7 1.0X
-SQL Json 10037 10147 156 1.0 957.2 1.0X
-SQL Parquet Vectorized: DataPageV1 1192 1226 47 8.8 113.7 8.4X
-SQL Parquet Vectorized: DataPageV2 2349 2423 105 4.5 224.0 4.3X
-SQL Parquet MR: DataPageV1 2995 3114 168 3.5 285.6 3.4X
-SQL Parquet MR: DataPageV2 3847 3900 75 2.7 366.9 2.6X
-ParquetReader Vectorized: DataPageV1 888 918 51 11.8 84.7 11.3X
-ParquetReader Vectorized: DataPageV2 2128 2159 43 4.9 203.0 4.7X
-SQL ORC Vectorized 837 908 61 12.5 79.8 12.0X
-SQL ORC MR 2792 2882 127 3.8 266.3 3.6X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 11418 11463 63 0.9 1088.9 1.0X
+SQL Json 9698 9938 339 1.1 924.9 1.2X
+SQL Parquet Vectorized: DataPageV1 1176 1207 45 8.9 112.1 9.7X
+SQL Parquet Vectorized: DataPageV2 1652 1669 24 6.3 157.6 6.9X
+SQL Parquet MR: DataPageV1 3041 3119 109 3.4 290.0 3.8X
+SQL Parquet MR: DataPageV2 4030 4110 114 2.6 384.3 2.8X
+ParquetReader Vectorized: DataPageV1 1008 1014 8 10.4 96.2 11.3X
+ParquetReader Vectorized: DataPageV2 1247 1305 82 8.4 118.9 9.2X
+SQL ORC Vectorized 820 856 56 12.8 78.2 13.9X
+SQL ORC MR 2762 2807 64 3.8 263.4 4.1X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (50.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 7808 7810 3 1.3 744.6 1.0X
-SQL Json 7434 7491 82 1.4 708.9 1.1X
-SQL Parquet Vectorized: DataPageV1 1037 1044 10 10.1 98.9 7.5X
-SQL Parquet Vectorized: DataPageV2 1528 1529 3 6.9 145.7 5.1X
-SQL Parquet MR: DataPageV1 2300 2411 156 4.6 219.4 3.4X
-SQL Parquet MR: DataPageV2 2637 2639 4 4.0 251.5 3.0X
-ParquetReader Vectorized: DataPageV1 843 907 56 12.4 80.4 9.3X
-ParquetReader Vectorized: DataPageV2 1424 1446 30 7.4 135.8 5.5X
-SQL ORC Vectorized 1131 1132 1 9.3 107.8 6.9X
-SQL ORC MR 2781 2856 106 3.8 265.3 2.8X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 6752 6756 5 1.6 644.0 1.0X
+SQL Json 7469 7549 112 1.4 712.3 0.9X
+SQL Parquet Vectorized: DataPageV1 912 990 67 11.5 87.0 7.4X
+SQL Parquet Vectorized: DataPageV2 1141 1215 104 9.2 108.8 5.9X
+SQL Parquet MR: DataPageV1 2256 2418 229 4.6 215.1 3.0X
+SQL Parquet MR: DataPageV2 2712 2882 241 3.9 258.6 2.5X
+ParquetReader Vectorized: DataPageV1 956 960 6 11.0 91.2 7.1X
+ParquetReader Vectorized: DataPageV2 1211 1211 1 8.7 115.5 5.6X
+SQL ORC Vectorized 1135 1135 1 9.2 108.2 6.0X
+SQL ORC MR 2716 2766 70 3.9 259.0 2.5X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
String with Nulls Scan (95.0%): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 5357 5538 255 2.0 510.9 1.0X
-SQL Json 4354 4387 47 2.4 415.2 1.2X
-SQL Parquet Vectorized: DataPageV1 212 226 15 49.5 20.2 25.3X
-SQL Parquet Vectorized: DataPageV2 265 276 16 39.6 25.2 20.2X
-SQL Parquet MR: DataPageV1 1575 1578 4 6.7 150.2 3.4X
-SQL Parquet MR: DataPageV2 1624 1638 21 6.5 154.8 3.3X
-ParquetReader Vectorized: DataPageV1 219 234 14 47.8 20.9 24.4X
-ParquetReader Vectorized: DataPageV2 274 294 17 38.2 26.2 19.5X
-SQL ORC Vectorized 370 393 12 28.4 35.3 14.5X
-SQL ORC MR 1540 1545 7 6.8 146.9 3.5X
+SQL CSV 4496 4710 303 2.3 428.8 1.0X
+SQL Json 4324 4343 28 2.4 412.3 1.0X
+SQL Parquet Vectorized: DataPageV1 221 244 9 47.5 21.0 20.4X
+SQL Parquet Vectorized: DataPageV2 270 288 13 38.8 25.8 16.6X
+SQL Parquet MR: DataPageV1 1451 1461 15 7.2 138.3 3.1X
+SQL Parquet MR: DataPageV2 1364 1368 5 7.7 130.0 3.3X
+ParquetReader Vectorized: DataPageV1 256 258 2 40.9 24.5 17.5X
+ParquetReader Vectorized: DataPageV2 273 291 17 38.4 26.0 16.5X
+SQL ORC Vectorized 345 367 24 30.4 32.9 13.0X
+SQL ORC MR 1508 1509 2 7.0 143.8 3.0X
================================================================================================
Single Column Scan From Wide Columns
================================================================================================
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 2159 2212 74 0.5 2059.3 1.0X
-SQL Json 2836 2896 84 0.4 2704.5 0.8X
-SQL Parquet Vectorized: DataPageV1 54 59 9 19.5 51.4 40.1X
-SQL Parquet Vectorized: DataPageV2 66 72 8 15.9 63.1 32.7X
-SQL Parquet MR: DataPageV1 173 186 10 6.1 164.5 12.5X
-SQL Parquet MR: DataPageV2 159 172 8 6.6 151.8 13.6X
-SQL ORC Vectorized 54 60 10 19.2 52.0 39.6X
-SQL ORC MR 150 161 7 7.0 143.3 14.4X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 2036 2140 147 0.5 1941.4 1.0X
+SQL Json 2796 2927 186 0.4 2666.5 0.7X
+SQL Parquet Vectorized: DataPageV1 47 52 7 22.2 45.0 43.1X
+SQL Parquet Vectorized: DataPageV2 64 69 7 16.4 61.2 31.7X
+SQL Parquet MR: DataPageV1 176 190 11 5.9 168.1 11.5X
+SQL Parquet MR: DataPageV2 157 171 6 6.7 149.3 13.0X
+SQL ORC Vectorized 52 56 10 20.3 49.2 39.5X
+SQL ORC MR 142 152 8 7.4 135.9 14.3X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 50 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 5877 5883 8 0.2 5605.0 1.0X
-SQL Json 11474 11587 159 0.1 10942.9 0.5X
-SQL Parquet Vectorized: DataPageV1 66 72 7 15.9 63.1 88.9X
-SQL Parquet Vectorized: DataPageV2 83 90 8 12.6 79.4 70.6X
-SQL Parquet MR: DataPageV1 191 201 9 5.5 182.6 30.7X
-SQL Parquet MR: DataPageV2 179 187 9 5.9 170.3 32.9X
-SQL ORC Vectorized 70 76 12 14.9 67.1 83.5X
-SQL ORC MR 167 175 7 6.3 159.2 35.2X
-
-OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 5.11.0-1025-azure
+SQL CSV 5384 5560 249 0.2 5134.8 1.0X
+SQL Json 10934 11224 410 0.1 10427.1 0.5X
+SQL Parquet Vectorized: DataPageV1 62 67 7 16.8 59.5 86.3X
+SQL Parquet Vectorized: DataPageV2 79 85 7 13.3 75.3 68.1X
+SQL Parquet MR: DataPageV1 198 211 9 5.3 188.6 27.2X
+SQL Parquet MR: DataPageV2 177 188 9 5.9 168.7 30.4X
+SQL ORC Vectorized 67 73 10 15.6 64.0 80.2X
+SQL ORC MR 160 172 8 6.6 152.3 33.7X
+
+OpenJDK 64-Bit Server VM 1.8.0_322-b06 on Linux 5.11.0-1028-azure
Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Single Column Scan from 100 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
-SQL CSV 9695 9965 382 0.1 9245.8 1.0X
-SQL Json 22119 23566 2045 0.0 21094.6 0.4X
-SQL Parquet Vectorized: DataPageV1 96 104 7 10.9 91.6 100.9X
-SQL Parquet Vectorized: DataPageV2 113 121 8 9.3 107.8 85.8X
-SQL Parquet MR: DataPageV1 227 243 9 4.6 216.2 42.8X
-SQL Parquet MR: DataPageV2 210 225 12 5.0 200.2 46.2X
-SQL ORC Vectorized 90 96 10 11.7 85.7 107.9X
-SQL ORC MR 188 199 9 5.6 178.9 51.7X
+SQL CSV 9602 9882 396 0.1 9157.0 1.0X
+SQL Json 21369 21987 874 0.0 20379.5 0.4X
+SQL Parquet Vectorized: DataPageV1 90 97 7 11.7 85.4 107.2X
+SQL Parquet Vectorized: DataPageV2 107 115 7 9.8 102.0 89.8X
+SQL Parquet MR: DataPageV1 227 234 14 4.6 216.1 42.4X
+SQL Parquet MR: DataPageV2 204 216 10 5.1 194.4 47.1X
+SQL ORC Vectorized 81 89 8 12.9 77.6 118.1X
+SQL ORC MR 181 195 12 5.8 172.3 53.2X
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
index 07e35c1..5669534 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
@@ -29,6 +29,8 @@ import java.util.Map;
import java.util.Set;
import com.google.common.annotations.VisibleForTesting;
+import org.apache.parquet.VersionParser;
+import org.apache.parquet.VersionParser.ParsedVersion;
import org.apache.parquet.column.page.PageReadStore;
import scala.Option;
@@ -69,6 +71,9 @@ public abstract class SpecificParquetRecordReaderBase<T> extends RecordReader<Vo
protected MessageType fileSchema;
protected MessageType requestedSchema;
protected StructType sparkSchema;
+ // Keep track of the version of the parquet writer. An older version wrote
+ // corrupt delta byte arrays, and the version check is needed to detect that.
+ protected ParsedVersion writerVersion;
/**
* The total number of rows this RecordReader will eventually read. The sum of the
@@ -93,6 +98,12 @@ public abstract class SpecificParquetRecordReaderBase<T> extends RecordReader<Vo
HadoopInputFile.fromPath(file, configuration), options);
this.reader = new ParquetRowGroupReaderImpl(fileReader);
this.fileSchema = fileReader.getFileMetaData().getSchema();
+ try {
+ this.writerVersion = VersionParser.parse(fileReader.getFileMetaData().getCreatedBy());
+ } catch (Exception e) {
+ // Swallow any exception, if we cannot parse the version we will revert to a sequential read
+ // if the column is a delta byte array encoding (due to PARQUET-246).
+ }
Map<String, String> fileMetadata = fileReader.getFileMetaData().getKeyValueMetaData();
ReadSupport<T> readSupport = getReadSupportInstance(getReadSupportClass(configuration));
ReadSupport.ReadContext readContext = readSupport.init(new InitContext(
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
index 57a307b..ee09d2b 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
@@ -21,6 +21,8 @@ import java.io.IOException;
import java.time.ZoneId;
import java.util.PrimitiveIterator;
+import org.apache.parquet.CorruptDeltaByteArrays;
+import org.apache.parquet.VersionParser.ParsedVersion;
import org.apache.parquet.bytes.ByteBufferInputStream;
import org.apache.parquet.bytes.BytesInput;
import org.apache.parquet.bytes.BytesUtils;
@@ -28,6 +30,7 @@ import org.apache.parquet.column.ColumnDescriptor;
import org.apache.parquet.column.Dictionary;
import org.apache.parquet.column.Encoding;
import org.apache.parquet.column.page.*;
+import org.apache.parquet.column.values.RequiresPreviousReader;
import org.apache.parquet.column.values.ValuesReader;
import org.apache.parquet.schema.LogicalTypeAnnotation;
import org.apache.parquet.schema.LogicalTypeAnnotation.DateLogicalTypeAnnotation;
@@ -86,6 +89,7 @@ public class VectorizedColumnReader {
private final ColumnDescriptor descriptor;
private final LogicalTypeAnnotation logicalTypeAnnotation;
private final String datetimeRebaseMode;
+ private final ParsedVersion writerVersion;
public VectorizedColumnReader(
ColumnDescriptor descriptor,
@@ -96,7 +100,8 @@ public class VectorizedColumnReader {
String datetimeRebaseMode,
String datetimeRebaseTz,
String int96RebaseMode,
- String int96RebaseTz) throws IOException {
+ String int96RebaseTz,
+ ParsedVersion writerVersion) throws IOException {
this.descriptor = descriptor;
this.pageReader = pageReader;
this.readState = new ParquetReadState(descriptor.getMaxDefinitionLevel(), rowIndexes);
@@ -129,6 +134,7 @@ public class VectorizedColumnReader {
this.datetimeRebaseMode = datetimeRebaseMode;
assert "LEGACY".equals(int96RebaseMode) || "EXCEPTION".equals(int96RebaseMode) ||
"CORRECTED".equals(int96RebaseMode);
+ this.writerVersion = writerVersion;
}
private boolean isLazyDecodingSupported(PrimitiveType.PrimitiveTypeName typeName) {
@@ -259,6 +265,7 @@ public class VectorizedColumnReader {
int pageValueCount,
Encoding dataEncoding,
ByteBufferInputStream in) throws IOException {
+ ValuesReader previousReader = this.dataColumn;
if (dataEncoding.usesDictionary()) {
this.dataColumn = null;
if (dictionary == null) {
@@ -283,6 +290,12 @@ public class VectorizedColumnReader {
} catch (IOException e) {
throw new IOException("could not read page in col " + descriptor, e);
}
+ // for PARQUET-246 (See VectorizedDeltaByteArrayReader.setPreviousValues)
+ if (CorruptDeltaByteArrays.requiresSequentialReads(writerVersion, dataEncoding) &&
+ previousReader instanceof RequiresPreviousReader) {
+ // previousReader can only be set if reading sequentially
+ ((RequiresPreviousReader) dataColumn).setPreviousReader(previousReader);
+ }
}
private ValuesReader getValuesReader(Encoding encoding) {
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java
index 7b2aac3..9c6596a 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaBinaryPackedReader.java
@@ -90,6 +90,7 @@ public class VectorizedDeltaBinaryPackedReader extends VectorizedReaderBase {
Preconditions.checkArgument(miniSize % 8 == 0,
"miniBlockSize must be multiple of 8, but it's " + miniSize);
this.miniBlockSizeInValues = (int) miniSize;
+ // True value count. May be less than valueCount because of nulls
this.totalValueCount = BytesUtils.readUnsignedVarInt(in);
this.bitWidths = new int[miniBlockNumInABlock];
this.unpackedValuesBuffer = new long[miniBlockSizeInValues];
@@ -97,6 +98,11 @@ public class VectorizedDeltaBinaryPackedReader extends VectorizedReaderBase {
firstValue = BytesUtils.readZigZagVarLong(in);
}
+ // True value count. May be less than valueCount because of nulls
+ int getTotalValueCount() {
+ return totalValueCount;
+ }
+
@Override
public byte readByte() {
readValues(1, null, 0, (w, r, v) -> byteVal = (byte) v);
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java
index 72b760d..b3fc54a8 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaByteArrayReader.java
@@ -16,50 +16,127 @@
*/
package org.apache.spark.sql.execution.datasources.parquet;
+import static org.apache.spark.sql.types.DataTypes.BinaryType;
+import static org.apache.spark.sql.types.DataTypes.IntegerType;
+
import org.apache.parquet.bytes.ByteBufferInputStream;
-import org.apache.parquet.column.values.deltastrings.DeltaByteArrayReader;
+import org.apache.parquet.column.values.RequiresPreviousReader;
+import org.apache.parquet.column.values.ValuesReader;
import org.apache.parquet.io.api.Binary;
+import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector;
import org.apache.spark.sql.execution.vectorized.WritableColumnVector;
import java.io.IOException;
import java.nio.ByteBuffer;
/**
- * An implementation of the Parquet DELTA_BYTE_ARRAY decoder that supports the vectorized interface.
+ * An implementation of the Parquet DELTA_BYTE_ARRAY decoder that supports the vectorized
+ * interface.
*/
-public class VectorizedDeltaByteArrayReader extends VectorizedReaderBase {
- private final DeltaByteArrayReader deltaByteArrayReader = new DeltaByteArrayReader();
+public class VectorizedDeltaByteArrayReader extends VectorizedReaderBase
+ implements VectorizedValuesReader, RequiresPreviousReader {
+
+ private final VectorizedDeltaBinaryPackedReader prefixLengthReader;
+ private final VectorizedDeltaLengthByteArrayReader suffixReader;
+ private WritableColumnVector prefixLengthVector;
+ private ByteBuffer previous;
+ private int currentRow = 0;
+
+ // Temporary variable used by readBinary
+ private final WritableColumnVector binaryValVector;
+ // Temporary variable used by skipBinary
+ private final WritableColumnVector tempBinaryValVector;
+
+ VectorizedDeltaByteArrayReader() {
+ this.prefixLengthReader = new VectorizedDeltaBinaryPackedReader();
+ this.suffixReader = new VectorizedDeltaLengthByteArrayReader();
+ binaryValVector = new OnHeapColumnVector(1, BinaryType);
+ tempBinaryValVector = new OnHeapColumnVector(1, BinaryType);
+ }
@Override
public void initFromPage(int valueCount, ByteBufferInputStream in) throws IOException {
- deltaByteArrayReader.initFromPage(valueCount, in);
+ prefixLengthVector = new OnHeapColumnVector(valueCount, IntegerType);
+ prefixLengthReader.initFromPage(valueCount, in);
+ prefixLengthReader.readIntegers(prefixLengthReader.getTotalValueCount(),
+ prefixLengthVector, 0);
+ suffixReader.initFromPage(valueCount, in);
}
@Override
public Binary readBinary(int len) {
- return deltaByteArrayReader.readBytes();
+ readValues(1, binaryValVector, 0);
+ return Binary.fromConstantByteArray(binaryValVector.getBinary(0));
}
- @Override
- public void readBinary(int total, WritableColumnVector c, int rowId) {
+ private void readValues(int total, WritableColumnVector c, int rowId) {
for (int i = 0; i < total; i++) {
- Binary binary = deltaByteArrayReader.readBytes();
- ByteBuffer buffer = binary.toByteBuffer();
- if (buffer.hasArray()) {
- c.putByteArray(rowId + i, buffer.array(), buffer.arrayOffset() + buffer.position(),
- binary.length());
- } else {
- byte[] bytes = new byte[binary.length()];
- buffer.get(bytes);
- c.putByteArray(rowId + i, bytes);
+ // NOTE: due to PARQUET-246, it is important that we
+ // respect prefixLength which was read from prefixLengthReader,
+ // even for the *first* value of a page. Even though the first
+ // value of the page should have an empty prefix, it may not
+ // because of PARQUET-246.
+ int prefixLength = prefixLengthVector.getInt(currentRow);
+ ByteBuffer suffix = suffixReader.getBytes(currentRow);
+ byte[] suffixArray = suffix.array();
+ int suffixLength = suffix.limit() - suffix.position();
+ int length = prefixLength + suffixLength;
+
+ // We have to do this to materialize the output
+ WritableColumnVector arrayData = c.arrayData();
+ int offset = arrayData.getElementsAppended();
+ if (prefixLength != 0) {
+ arrayData.appendBytes(prefixLength, previous.array(), previous.position());
}
+ arrayData.appendBytes(suffixLength, suffixArray, suffix.position());
+ c.putArray(rowId + i, offset, length);
+ previous = arrayData.getByteBuffer(offset, length);
+ currentRow++;
+ }
+ }
+
+ @Override
+ public void readBinary(int total, WritableColumnVector c, int rowId) {
+ readValues(total, c, rowId);
+ }
+
+ /**
+ * There was a bug (PARQUET-246) in which DeltaByteArrayWriter's reset() method did not clear the
+ * previous value state that it tracks internally. This resulted in the first value of all pages
+ * (except for the first page) to be a delta from the last value of the previous page. In order to
+ * read corrupted files written with this bug, when reading a new page we need to recover the
+ * previous page's last value to use it (if needed) to read the first value.
+ */
+ public void setPreviousReader(ValuesReader reader) {
+ if (reader != null) {
+ this.previous = ((VectorizedDeltaByteArrayReader) reader).previous;
}
}
@Override
public void skipBinary(int total) {
+ WritableColumnVector c1 = tempBinaryValVector;
+ WritableColumnVector c2 = binaryValVector;
+
for (int i = 0; i < total; i++) {
- deltaByteArrayReader.skip();
+ int prefixLength = prefixLengthVector.getInt(currentRow);
+ ByteBuffer suffix = suffixReader.getBytes(currentRow);
+ byte[] suffixArray = suffix.array();
+ int suffixLength = suffix.limit() - suffix.position();
+ int length = prefixLength + suffixLength;
+
+ WritableColumnVector arrayData = c1.arrayData();
+ c1.reset();
+ if (prefixLength != 0) {
+ arrayData.appendBytes(prefixLength, previous.array(), previous.position());
+ }
+ arrayData.appendBytes(suffixLength, suffixArray, suffix.position());
+ previous = arrayData.getByteBuffer(0, length);
+ currentRow++;
+
+ WritableColumnVector tmp = c1;
+ c1 = c2;
+ c2 = tmp;
}
}
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java
new file mode 100644
index 0000000..ac5b852
--- /dev/null
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedDeltaLengthByteArrayReader.java
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.parquet;
+
+import static org.apache.spark.sql.types.DataTypes.IntegerType;
+
+import java.io.EOFException;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import org.apache.parquet.bytes.ByteBufferInputStream;
+import org.apache.parquet.io.ParquetDecodingException;
+import org.apache.spark.sql.execution.vectorized.OnHeapColumnVector;
+import org.apache.spark.sql.execution.vectorized.WritableColumnVector;
+
+/**
+ * An implementation of the Parquet DELTA_LENGTH_BYTE_ARRAY decoder that supports the vectorized
+ * interface.
+ */
+public class VectorizedDeltaLengthByteArrayReader extends VectorizedReaderBase implements
+ VectorizedValuesReader {
+
+ private final VectorizedDeltaBinaryPackedReader lengthReader;
+ private ByteBufferInputStream in;
+ private WritableColumnVector lengthsVector;
+ private int currentRow = 0;
+
+ VectorizedDeltaLengthByteArrayReader() {
+ lengthReader = new VectorizedDeltaBinaryPackedReader();
+ }
+
+ @Override
+ public void initFromPage(int valueCount, ByteBufferInputStream in) throws IOException {
+ lengthsVector = new OnHeapColumnVector(valueCount, IntegerType);
+ lengthReader.initFromPage(valueCount, in);
+ lengthReader.readIntegers(lengthReader.getTotalValueCount(), lengthsVector, 0);
+ this.in = in.remainingStream();
+ }
+
+ @Override
+ public void readBinary(int total, WritableColumnVector c, int rowId) {
+ ByteBuffer buffer;
+ ByteBufferOutputWriter outputWriter = ByteBufferOutputWriter::writeArrayByteBuffer;
+ int length;
+ for (int i = 0; i < total; i++) {
+ length = lengthsVector.getInt(rowId + i);
+ try {
+ buffer = in.slice(length);
+ } catch (EOFException e) {
+ throw new ParquetDecodingException("Failed to read " + length + " bytes");
+ }
+ outputWriter.write(c, rowId + i, buffer, length);
+ }
+ currentRow += total;
+ }
+
+ public ByteBuffer getBytes(int rowId) {
+ int length = lengthsVector.getInt(rowId);
+ try {
+ return in.slice(length);
+ } catch (EOFException e) {
+ throw new ParquetDecodingException("Failed to read " + length + " bytes");
+ }
+ }
+
+ @Override
+ public void skipBinary(int total) {
+ for (int i = 0; i < total; i++) {
+ int remaining = lengthsVector.getInt(currentRow + i);
+ while (remaining > 0) {
+ remaining -= in.skip(remaining);
+ }
+ }
+ currentRow += total;
+ }
+}
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
index 0e976be..da23b5f 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java
@@ -367,7 +367,8 @@ public class VectorizedParquetRecordReader extends SpecificParquetRecordReaderBa
datetimeRebaseMode,
datetimeRebaseTz,
int96RebaseMode,
- int96RebaseTz);
+ int96RebaseTz,
+ writerVersion);
}
totalCountLoadedSoFar += pages.getRowCount();
}
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java
index 7ddece0..4308614 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedValuesReader.java
@@ -17,6 +17,8 @@
package org.apache.spark.sql.execution.datasources.parquet;
+import java.nio.ByteBuffer;
+
import org.apache.spark.sql.execution.vectorized.WritableColumnVector;
import org.apache.parquet.io.api.Binary;
@@ -86,4 +88,18 @@ public interface VectorizedValuesReader {
void write(WritableColumnVector outputColumnVector, int rowId, long val);
}
+ @FunctionalInterface
+ interface ByteBufferOutputWriter {
+ void write(WritableColumnVector c, int rowId, ByteBuffer val, int length);
+
+ static void writeArrayByteBuffer(WritableColumnVector c, int rowId, ByteBuffer val,
+ int length) {
+ c.putByteArray(rowId,
+ val.array(),
+ val.arrayOffset() + val.position(),
+ length);
+ }
+
+ static void skipWrite(WritableColumnVector c, int rowId, ByteBuffer val, int length) { }
+ }
}
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
index bbe9681..42552c7 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
@@ -221,6 +221,11 @@ public final class OffHeapColumnVector extends WritableColumnVector {
return UTF8String.fromAddress(null, data + rowId, count);
}
+ @Override
+ public ByteBuffer getByteBuffer(int rowId, int count) {
+ return ByteBuffer.wrap(getBytes(rowId, count));
+ }
+
//
// APIs dealing with shorts
//
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
index 833a93f2..d246a3c 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
@@ -219,6 +219,12 @@ public final class OnHeapColumnVector extends WritableColumnVector {
return UTF8String.fromBytes(byteData, rowId, count);
}
+ @Override
+ public ByteBuffer getByteBuffer(int rowId, int count) {
+ return ByteBuffer.wrap(byteData, rowId, count);
+ }
+
+
//
// APIs dealing with Shorts
//
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
index 5e01c37..ae457a1 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
@@ -18,6 +18,7 @@ package org.apache.spark.sql.execution.vectorized;
import java.math.BigDecimal;
import java.math.BigInteger;
+import java.nio.ByteBuffer;
import com.google.common.annotations.VisibleForTesting;
@@ -444,6 +445,12 @@ public abstract class WritableColumnVector extends ColumnVector {
}
/**
+ * Gets the values of bytes from [rowId, rowId + count), as a ByteBuffer.
+ * This method is similar to {@link ColumnVector#getBytes(int, int)}, but avoids making a copy.
+ */
+ public abstract ByteBuffer getByteBuffer(int rowId, int count);
+
+ /**
* Append APIs. These APIs all behave similarly and will append data to the current vector. It
* is not valid to mix the put and append APIs. The append APIs are slower and should only be
* used if the sizes are not known up front.
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaByteArrayEncodingSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaByteArrayEncodingSuite.scala
new file mode 100644
index 0000000..c54eef3
--- /dev/null
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaByteArrayEncodingSuite.scala
@@ -0,0 +1,143 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.parquet.bytes.DirectByteBufferAllocator
+import org.apache.parquet.column.values.Utils
+import org.apache.parquet.column.values.deltastrings.DeltaByteArrayWriter
+
+import org.apache.spark.sql.execution.vectorized.{OnHeapColumnVector, WritableColumnVector}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types.{IntegerType, StringType}
+
+/**
+ * Read tests for vectorized Delta byte array reader.
+ * Translated from * org.apache.parquet.column.values.delta.TestDeltaByteArray
+ */
+class ParquetDeltaByteArrayEncodingSuite extends ParquetCompatibilityTest with SharedSparkSession {
+ val values: Array[String] = Array("parquet-mr", "parquet", "parquet-format");
+ val randvalues: Array[String] = Utils.getRandomStringSamples(10000, 32)
+
+ var writer: DeltaByteArrayWriter = _
+ var reader: VectorizedDeltaByteArrayReader = _
+ private var writableColumnVector: WritableColumnVector = _
+
+ protected override def beforeEach(): Unit = {
+ writer = new DeltaByteArrayWriter(64 * 1024, 64 * 1024, new DirectByteBufferAllocator)
+ reader = new VectorizedDeltaByteArrayReader()
+ super.beforeAll()
+ }
+
+ test("test Serialization") {
+ assertReadWrite(writer, reader, values)
+ }
+
+ test("random strings") {
+ assertReadWrite(writer, reader, randvalues)
+ }
+
+ test("random strings with skip") {
+ assertReadWriteWithSkip(writer, reader, randvalues)
+ }
+
+ test("random strings with skipN") {
+ assertReadWriteWithSkipN(writer, reader, randvalues)
+ }
+
+ test("test lengths") {
+ var reader = new VectorizedDeltaBinaryPackedReader
+ Utils.writeData(writer, values)
+ val data = writer.getBytes.toInputStream
+ val length = values.length
+ writableColumnVector = new OnHeapColumnVector(length, IntegerType)
+ reader.initFromPage(length, data)
+ reader.readIntegers(length, writableColumnVector, 0)
+ // test prefix lengths
+ assert(0 == writableColumnVector.getInt(0))
+ assert(7 == writableColumnVector.getInt(1))
+ assert(7 == writableColumnVector.getInt(2))
+
+ reader = new VectorizedDeltaBinaryPackedReader
+ writableColumnVector = new OnHeapColumnVector(length, IntegerType)
+ reader.initFromPage(length, data)
+ reader.readIntegers(length, writableColumnVector, 0)
+ // test suffix lengths
+ assert(10 == writableColumnVector.getInt(0))
+ assert(0 == writableColumnVector.getInt(1))
+ assert(7 == writableColumnVector.getInt(2))
+ }
+
+ private def assertReadWrite(
+ writer: DeltaByteArrayWriter,
+ reader: VectorizedDeltaByteArrayReader,
+ vals: Array[String]): Unit = {
+ Utils.writeData(writer, vals)
+ val length = vals.length
+ val is = writer.getBytes.toInputStream
+
+ writableColumnVector = new OnHeapColumnVector(length, StringType)
+
+ reader.initFromPage(length, is)
+ reader.readBinary(length, writableColumnVector, 0)
+
+ for (i <- 0 until length) {
+ assert(vals(i).getBytes() sameElements writableColumnVector.getBinary(i))
+ }
+ }
+
+ private def assertReadWriteWithSkip(
+ writer: DeltaByteArrayWriter,
+ reader: VectorizedDeltaByteArrayReader,
+ vals: Array[String]): Unit = {
+ Utils.writeData(writer, vals)
+ val length = vals.length
+ val is = writer.getBytes.toInputStream
+ writableColumnVector = new OnHeapColumnVector(length, StringType)
+ reader.initFromPage(length, is)
+ var i = 0
+ while ( {
+ i < vals.length
+ }) {
+ reader.readBinary(1, writableColumnVector, i)
+ assert(vals(i).getBytes() sameElements writableColumnVector.getBinary(i))
+ reader.skipBinary(1)
+ i += 2
+ }
+ }
+
+ private def assertReadWriteWithSkipN(
+ writer: DeltaByteArrayWriter,
+ reader: VectorizedDeltaByteArrayReader,
+ vals: Array[String]): Unit = {
+ Utils.writeData(writer, vals)
+ val length = vals.length
+ val is = writer.getBytes.toInputStream
+ writableColumnVector = new OnHeapColumnVector(length, StringType)
+ reader.initFromPage(length, is)
+ var skipCount = 0
+ var i = 0
+ while ( {
+ i < vals.length
+ }) {
+ skipCount = (vals.length - i) / 2
+ reader.readBinary(1, writableColumnVector, i)
+ assert(vals(i).getBytes() sameElements writableColumnVector.getBinary(i))
+ reader.skipBinary(skipCount)
+ i += skipCount + 1
+ }
+ }
+}
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaLengthByteArrayEncodingSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaLengthByteArrayEncodingSuite.scala
new file mode 100644
index 0000000..17dc70d
--- /dev/null
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetDeltaLengthByteArrayEncodingSuite.scala
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.util.Random
+
+import org.apache.commons.lang3.RandomStringUtils
+import org.apache.parquet.bytes.{ByteBufferInputStream, DirectByteBufferAllocator}
+import org.apache.parquet.column.values.Utils
+import org.apache.parquet.column.values.deltalengthbytearray.DeltaLengthByteArrayValuesWriter
+import org.apache.parquet.io.api.Binary
+
+import org.apache.spark.sql.execution.vectorized.{OnHeapColumnVector, WritableColumnVector}
+import org.apache.spark.sql.test.SharedSparkSession
+import org.apache.spark.sql.types.{IntegerType, StringType}
+
+/**
+ * Read tests for vectorized Delta length byte array reader.
+ * Translated from
+ * org.apache.parquet.column.values.delta.TestDeltaLengthByteArray
+ */
+class ParquetDeltaLengthByteArrayEncodingSuite
+ extends ParquetCompatibilityTest
+ with SharedSparkSession {
+ val values: Array[String] = Array("parquet", "hadoop", "mapreduce")
+ var writer: DeltaLengthByteArrayValuesWriter = _
+ var reader: VectorizedDeltaLengthByteArrayReader = _
+ private var writableColumnVector: WritableColumnVector = _
+
+ protected override def beforeEach(): Unit = {
+ writer =
+ new DeltaLengthByteArrayValuesWriter(64 * 1024, 64 * 1024, new DirectByteBufferAllocator)
+ reader = new VectorizedDeltaLengthByteArrayReader()
+ super.beforeAll()
+ }
+
+ test("test serialization") {
+ writeData(writer, values)
+ readAndValidate(reader, writer.getBytes.toInputStream, values.length, values)
+ }
+
+ test("random strings") {
+ val values = Utils.getRandomStringSamples(1000, 32)
+ writeData(writer, values)
+ readAndValidate(reader, writer.getBytes.toInputStream, values.length, values)
+ }
+
+ test("random strings with empty strings") {
+ val values = getRandomStringSamplesWithEmptyStrings(1000, 32)
+ writeData(writer, values)
+ readAndValidate(reader, writer.getBytes.toInputStream, values.length, values)
+ }
+
+ test("skip with random strings") {
+ val values = Utils.getRandomStringSamples(1000, 32)
+ writeData(writer, values)
+ reader.initFromPage(values.length, writer.getBytes.toInputStream)
+ writableColumnVector = new OnHeapColumnVector(values.length, StringType)
+ var i = 0
+ while (i < values.length) {
+ reader.readBinary(1, writableColumnVector, i)
+ assert(values(i).getBytes() sameElements writableColumnVector.getBinary(i))
+ reader.skipBinary(1)
+ i += 2
+ }
+ reader = new VectorizedDeltaLengthByteArrayReader()
+ reader.initFromPage(values.length, writer.getBytes.toInputStream)
+ writableColumnVector = new OnHeapColumnVector(values.length, StringType)
+ var skipCount = 0
+ i = 0
+ while (i < values.length) {
+ skipCount = (values.length - i) / 2
+ reader.readBinary(1, writableColumnVector, i)
+ assert(values(i).getBytes() sameElements writableColumnVector.getBinary(i))
+ reader.skipBinary(skipCount)
+ i += skipCount + 1
+ }
+ }
+
+ // Read the lengths from the beginning of the buffer and compare with the lengths of the values
+ test("test lengths") {
+ val reader = new VectorizedDeltaBinaryPackedReader
+ writeData(writer, values)
+ val length = values.length
+ writableColumnVector = new OnHeapColumnVector(length, IntegerType)
+ reader.initFromPage(length, writer.getBytes.toInputStream)
+ reader.readIntegers(length, writableColumnVector, 0)
+ for (i <- 0 until length) {
+ assert(values(i).length == writableColumnVector.getInt(i))
+ }
+ }
+
+ private def writeData(writer: DeltaLengthByteArrayValuesWriter, values: Array[String]): Unit = {
+ for (i <- values.indices) {
+ writer.writeBytes(Binary.fromString(values(i)))
+ }
+ }
+
+ private def readAndValidate(
+ reader: VectorizedDeltaLengthByteArrayReader,
+ is: ByteBufferInputStream,
+ length: Int,
+ expectedValues: Array[String]): Unit = {
+
+ writableColumnVector = new OnHeapColumnVector(length, StringType)
+
+ reader.initFromPage(length, is)
+ reader.readBinary(length, writableColumnVector, 0)
+
+ for (i <- 0 until length) {
+ assert(expectedValues(i).getBytes() sameElements writableColumnVector.getBinary(i))
+ }
+ }
+
+ def getRandomStringSamplesWithEmptyStrings(numSamples: Int, maxLength: Int): Array[String] = {
+ val randomLen = new Random
+ val randomEmpty = new Random
+ val samples: Array[String] = new Array[String](numSamples)
+ for (i <- 0 until numSamples) {
+ var maxLen: Int = randomLen.nextInt(maxLength)
+ if(randomEmpty.nextInt() % 11 != 0) {
+ maxLen = 0;
+ }
+ samples(i) = RandomStringUtils.randomAlphanumeric(0, maxLen)
+ }
+ samples
+ }
+}
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
index f7100a5..07e2849 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetEncodingSuite.scala
@@ -27,6 +27,7 @@ import org.apache.parquet.column.{Encoding, ParquetProperties}
import org.apache.parquet.hadoop.ParquetOutputFormat
import org.apache.spark.TestUtils
+import org.apache.spark.memory.MemoryMode
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.util.DateTimeUtils
import org.apache.spark.sql.internal.SQLConf
@@ -47,6 +48,13 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSparkSess
null.asInstanceOf[Duration],
null.asInstanceOf[java.lang.Boolean])
+ private def withMemoryModes(f: String => Unit): Unit = {
+ Seq(MemoryMode.OFF_HEAP, MemoryMode.ON_HEAP).foreach(mode => {
+ val offHeap = if (mode == MemoryMode.OFF_HEAP) "true" else "false"
+ f(offHeap)
+ })
+ }
+
test("All Types Dictionary") {
(1 :: 1000 :: Nil).foreach { n => {
withTempPath { dir =>
@@ -141,45 +149,54 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSparkSess
)
val hadoopConf = spark.sessionState.newHadoopConfWithOptions(extraOptions)
- withSQLConf(
- SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true",
- ParquetOutputFormat.JOB_SUMMARY_LEVEL -> "ALL") {
- withTempPath { dir =>
- val path = s"${dir.getCanonicalPath}/test.parquet"
-
- val data = (1 to 3).map { i =>
- ( i, i.toLong, i.toShort, Array[Byte](i.toByte), s"test_${i}",
- DateTimeUtils.fromJavaDate(Date.valueOf(s"2021-11-0" + i)),
- DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(s"2020-11-01 12:00:0" + i)),
- Period.of(1, i, 0), Duration.ofMillis(i * 100),
- new BigDecimal(java.lang.Long.toUnsignedString(i*100000))
- )
+ withMemoryModes { offHeapMode =>
+ withSQLConf(
+ SQLConf.COLUMN_VECTOR_OFFHEAP_ENABLED.key -> offHeapMode,
+ SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "true",
+ ParquetOutputFormat.JOB_SUMMARY_LEVEL -> "ALL") {
+ withTempPath { dir =>
+ val path = s"${dir.getCanonicalPath}/test.parquet"
+ // Have more than 2 * 4096 records (so we have multiple tasks and each task
+ // reads at least twice from the reader). This will catch any issues with state
+ // maintained by the reader(s)
+ // Add at least one string with a null
+ val data = (1 to 8193).map { i =>
+ (i,
+ i.toLong, i.toShort, Array[Byte](i.toByte),
+ if (i % 2 == 1) s"test_$i" else null,
+ DateTimeUtils.fromJavaDate(Date.valueOf(s"2021-11-0" + ((i % 9) + 1))),
+ DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(s"2020-11-01 12:00:0" + (i % 10))),
+ Period.of(1, (i % 11) + 1, 0),
+ Duration.ofMillis(((i % 9) + 1) * 100),
+ new BigDecimal(java.lang.Long.toUnsignedString(i * 100000))
+ )
+ }
+
+ spark.createDataFrame(data)
+ .write.options(extraOptions).mode("overwrite").parquet(path)
+
+ val blockMetadata = readFooter(new Path(path), hadoopConf).getBlocks.asScala.head
+ val columnChunkMetadataList = blockMetadata.getColumns.asScala
+
+ // Verify that indeed delta encoding is used for each column
+ assert(columnChunkMetadataList.length === 10)
+ assert(columnChunkMetadataList(0).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ assert(columnChunkMetadataList(1).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ assert(columnChunkMetadataList(2).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ // Both fixed-length byte array and variable-length byte array (also called BINARY)
+ // are use DELTA_BYTE_ARRAY for encoding
+ assert(columnChunkMetadataList(3).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
+ assert(columnChunkMetadataList(4).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
+
+ assert(columnChunkMetadataList(5).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ assert(columnChunkMetadataList(6).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ assert(columnChunkMetadataList(7).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ assert(columnChunkMetadataList(8).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
+ assert(columnChunkMetadataList(9).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
+
+ val actual = spark.read.parquet(path).collect()
+ assert(actual.sortBy(_.getInt(0)) === data.map(Row.fromTuple));
}
-
- spark.createDataFrame(data)
- .write.options(extraOptions).mode("overwrite").parquet(path)
-
- val blockMetadata = readFooter(new Path(path), hadoopConf).getBlocks.asScala.head
- val columnChunkMetadataList = blockMetadata.getColumns.asScala
-
- // Verify that indeed delta encoding is used for each column
- assert(columnChunkMetadataList.length === 10)
- assert(columnChunkMetadataList(0).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- assert(columnChunkMetadataList(1).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- assert(columnChunkMetadataList(2).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- // Both fixed-length byte array and variable-length byte array (also called BINARY)
- // are use DELTA_BYTE_ARRAY for encoding
- assert(columnChunkMetadataList(3).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
- assert(columnChunkMetadataList(4).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
-
- assert(columnChunkMetadataList(5).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- assert(columnChunkMetadataList(6).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- assert(columnChunkMetadataList(7).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- assert(columnChunkMetadataList(8).getEncodings.contains(Encoding.DELTA_BINARY_PACKED))
- assert(columnChunkMetadataList(9).getEncodings.contains(Encoding.DELTA_BYTE_ARRAY))
-
- val actual = spark.read.parquet(path).collect()
- assert(actual.sortBy(_.getInt(0)) === data.map(Row.fromTuple));
}
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org