You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/03 07:04:12 UTC

[GitHub] viirya opened a new pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC

viirya opened a new pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
URL: https://github.com/apache/spark/pull/23943
 
 
   ## What changes were proposed in this pull request?
   
   We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too.
   
   Note: This only covers ORC v1. We can deal with ORC v2 as a TODO item.
   
   ## Benchmark
   
   Ran benchmark with `OrcNestedSchemaPruningBenchmark`.
   
   Before:
   ```scala
   [info] Running benchmark: Selection
   [info]   Running case: Top-level column
   [info]   Stopped after 27 iterations, 2054 ms
   [info]   Running case: Nested column
   [info]   Stopped after 10 iterations, 14384 ms
   [info] 
   [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.3
   [info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
   [info] Selection:                               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------
   [info] Top-level column                                64 /   76         15.6          63.9       1.0X
   [info] Nested column                                 1300 / 1438          0.8        1299.7       0.0X                               
   ```
   
   After:
   ```scala
   [info] Running benchmark: Selection
   [info]   Running case: Top-level column
   [info]   Stopped after 24 iterations, 2051 ms
   [info]   Running case: Nested column
   [info]   Stopped after 10 iterations, 5005 ms
   [info] 
   [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.3
   [info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
   [info] Selection:                               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
   [info] ------------------------------------------------------------------------------------------------
   [info] Top-level column                                71 /   85         14.2          70.6       1.0X
   [info] Nested column                                  480 /  501          2.1         479.5       0.1X                                   
   ```
                                                                                                                   
   ## How was this patch tested?
   
   Added tests.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org