You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/13 17:14:17 UTC

[GitHub] [spark] sunchao commented on pull request #33639: [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet

sunchao commented on pull request #33639:
URL: https://github.com/apache/spark/pull/33639#issuecomment-942538721


   Thanks @sadikovi and @timarmstrong for taking a look! These are great points.
   
   For the case of `UINT_32` and `UINT_64`, I think `parquet-mr` will not emit stats (see [here](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L722)) since the `SortOrder` for unsigned ints is `UNSIGNED`. In this case the code will fail and users should fallback by turning the config off. As an improvement I think we can detect this case and fallback at the compile time instead (see the following)
   
   In the case of binaries, I just realized we should use [`PrimitiveComparator`](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/schema/PrimitiveComparator.java) since the logic should be different depending on the logical type (e.g., string vs decimal). We should double check on that and add more test coverage.
   
    > it's a bad user experience if they need to turn off a very effective optimisation globally because of a single empty stat column in a (valid!) Parquet file.
   
   Indeed it'll be great if we can have graceful fallback mechanism, by checking the existence of stats at the query compile time instead of runtime. I think this is possible - we can process the block metadata before building the reader in `ParquetPartitionReaderFactory`.
   
   > I believe stats for strings can be truncated
   
   I think this is turned off by default (see [PARQUET-1685](https://issues.apache.org/jira/browse/PARQUET-1685) for more context), but yes it could cause issue if it's enabled. I think to be conservative we can perhaps disable MIN/MAX for binary strings. WDYT?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org