You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/03 07:04:12 UTC
[GitHub] viirya opened a new pull request #23943: [SPARK-27034][SQL] Nested
schema pruning for ORC
viirya opened a new pull request #23943: [SPARK-27034][SQL] Nested schema pruning for ORC
URL: https://github.com/apache/spark/pull/23943
## What changes were proposed in this pull request?
We only supported nested schema pruning for Parquet previously. This proposes to support nested schema pruning for ORC too.
Note: This only covers ORC v1. We can deal with ORC v2 as a TODO item.
## Benchmark
Ran benchmark with `OrcNestedSchemaPruningBenchmark`.
Before:
```scala
[info] Running benchmark: Selection
[info] Running case: Top-level column
[info] Stopped after 27 iterations, 2054 ms
[info] Running case: Nested column
[info] Stopped after 10 iterations, 14384 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.3
[info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
[info] Selection: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------
[info] Top-level column 64 / 76 15.6 63.9 1.0X
[info] Nested column 1300 / 1438 0.8 1299.7 0.0X
```
After:
```scala
[info] Running benchmark: Selection
[info] Running case: Top-level column
[info] Stopped after 24 iterations, 2051 ms
[info] Running case: Nested column
[info] Stopped after 10 iterations, 5005 ms
[info]
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.3
[info] Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
[info] Selection: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------
[info] Top-level column 71 / 85 14.2 70.6 1.0X
[info] Nested column 480 / 501 2.1 479.5 0.1X
```
## How was this patch tested?
Added tests.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org