You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/12/12 04:22:18 UTC

[GitHub] [spark] HyukjinKwon opened a new pull request #26857: [SPARK-30162][SQL] Add PushedFilters to metadata in Parquet DSv2 implementation

HyukjinKwon opened a new pull request #26857: [SPARK-30162][SQL] Add PushedFilters to metadata in Parquet DSv2 implementation
URL: https://github.com/apache/spark/pull/26857
 
 
   ### What changes were proposed in this pull request?
   
   This PR proposes to add `PushedFilters` into metadata to show the pushed filters in Parquet DSv2 implementation. In case of ORC, it is already added at https://github.com/apache/spark/pull/24719/files#diff-0fc82694b20da3cd2cbb07206920eef7R62-R64
   
   ### Why are the changes needed?
   
   In order for users to be able to debug, and to match with ORC.
   
   ### Does this PR introduce any user-facing change?
   
   ```scala
   spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
   spark.read.parquet("/tmp/foo").filter("5 > id").explain()
   ```
   
   **Before:**
   
   ```
   == Physical Plan ==
   *(1) Project [id#20L]
   +- *(1) Filter (isnotnull(id#20L) AND (5 > id#20L))
      +- *(1) ColumnarToRow
         +- BatchScan[id#20L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint>
   ```
   
   **After:**
   
   ```
   == Physical Plan ==
   *(1) Project [id#13L]
   +- *(1) Filter (isnotnull(id#13L) AND (5 > id#13L))
      +- *(1) ColumnarToRow
         +- BatchScan[id#13L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint>, PushedFilters: [IsNotNull(id), LessThan(id,5)]
   ```
   
   ### How was this patch tested?
   Unittest were added and manually tested.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org