You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/09/09 10:30:47 UTC

[GitHub] [incubator-iceberg] prodeezy opened a new pull request #462: Added V1 Vectorized Reader

prodeezy opened a new pull request #462: Added V1 Vectorized Reader
URL: https://github.com/apache/incubator-iceberg/pull/462
 
 
   Co-authored-by: Xabriel J Collazo Mojica <xc...@adobe.com>
   
   
   **Changes:**
   - Added a new reader viz. V1VectorizedReader that internally short circuits to using the V1 codepath [1]  which does most of the setup and work to perform vectorization. it's exactly what Vanilla Spark's reader does underneath the DSV1 implementation.
   - It builds an iterator which expects ColumnarBatches from the Objects returned by the resolving iterator.
   - We re-organized and optimized code while building ReadTask instances which considerably improved task initiation and planning time.
   - Setting `iceberg.read.enableV1VectorizedReader` to true enables this reader in IcebergSource.
   - The V1Vectorized reader is an independent class with copied code in some methods as we didn't want to degrade perf due to inheritance/virtual method calls (we noticed degradation when we did try to re-use code). 
   - I'v pushed this code to a separate branch [2] in case others want to give this a try. 
   
   
   
   **The Numbers:**
   
   
   Flat Data 10 files 10M rows each
   
   ```
   Benchmark                                                                            Mode  Cnt   Score   Error  Units
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  63.631 ± 1.300   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  28.322 ± 2.400   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  65.862 ± 2.480   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized10k                   ss    5  28.199 ± 1.255   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized20k                   ss    5  29.822 ± 2.848   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized5k                    ss    5  27.953 ± 0.949   s/op
   ```
   
   
   
   
   
   
   Flat Data Projections 10 files 10M rows each
   
   ```
   Benchmark                                                                            Mode  Cnt   Score   Error  Units
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  11.307 ± 1.791   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5   3.480 ± 0.087   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  11.057 ± 0.236   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized10k     ss    5   3.953 ± 1.592   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized20k     ss    5   3.619 ± 1.305   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized5k      ss    5   4.109 ± 1.734   s/op
   ```
   
   
   Filtered Data 500 files 10k rows each 
   
   ```
   Benchmark                                                                          Mode  Cnt  Score   Error  Units
   IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceNonVectorized    ss    5  2.139 ± 0.719   s/op
   IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceVectorized       ss    5  2.213 ± 0.598   s/op
   IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergNonVectorized       ss    5  0.144 ± 0.029   s/op
   IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized100k    ss    5  0.179 ± 0.019   s/op
   IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized10k     ss    5  0.189 ± 0.046   s/op
   IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized5k      ss    5  0.195 ± 0.137   s/op
   
   
   ```
   
   **Perf Notes:**
   - Iceberg V1 Vectorization's real gain (over current Iceberg impl) is in flat data scans. Notice how it's almost exactly same as vanilla spark vectorization.
   - Projections work equally well. Although we see Nested column projections are still not performing as well as we need to be able to push nested column projections down to Iceberg.
   - We saw a slight overhead with Iceberg V1 Vectorization over smaller workloads, but this goes away with larger data files.
   
   
   
   Footnotes: 
   [1] - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L197
   [2] - https://github.com/prodeezy/incubator-iceberg/tree/v1-vectorized-reader

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org