You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/11/21 22:24:53 UTC

[I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

alamb opened a new issue, #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295

   ### Describe the bug
   
   While working on https://github.com/apache/arrow-datafusion/issues/8229 I found another bug that is non obvious, but that can be clearly seen now thanks to https://github.com/apache/arrow-datafusion/issues/8110 and https://github.com/apache/arrow-datafusion/issues/8111 from @NGA-TRAN 
   
   
   
   
   ### To Reproduce
   
   ```sql
   ❯ copy (values ('foo'), ('bar'), ('baz')) to '/tmp/strings.parquet';
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row in set. Query took 0.023 seconds.
   ```
   
   And then look at the explain verbose up can see there are no min/max statisics shown:
   
   ```sql
   ❯ explain verbose select * from '/tmp/strings.parquet';
   
   |                                                            |                                                                                                                                                                |
   | physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/strings.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Null=Exact(0))]] |
   |                                                            |                                                                                                                                                                |
   +------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
   80 rows in set. Query took 0.002 seconds.
   ```
   
   ### Expected behavior
   
   I expect there to be min/max values extracted in the statistics for the strings, as there are for integers (`(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3))`)
   
   ```shell
   ❯ copy (values (1), (2), (3)) to '/tmp/ints.parquet';
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row in set. Query took 0.023 seconds.
   ```
   
   ```sql
   ❯ explain verbose select * from '/tmp/ints.parquet';
   ...
                                                                                                                  |
   | physical_plan                                              | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1]                                                                                                              |
   |                                                            |                                                                                                                                                                                                     |
   | physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[private/tmp/ints.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)) Null=Exact(0))]] |
   |                                                            |                                                                                                                                                                                                     |
   +------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1925585482

   Could I pick this ticket up?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1821799616

   I plan to fix this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1925739871

   At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1920898048

   It seems like the issue has been resolved.
   ```
   | physical_plan_with_stats                                   | ParquetExec: file_groups={1 group: [[tmp/ints.parquet]]}, projection=[column1], statistics=[Rows=Exact(3), Bytes=Absent, [(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3)) Null=Exact(0))]] |
   |                                                            |                                                                                                                                                                                             |
   +------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1925687259

   In `fn summarize_min_max`,  it cannot handle `ByteArray(ValueStatistics<ByteArray>)` well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1925739693

   > In `fn summarize_min_max`, it cannot handle `ByteArray(ValueStatistics<ByteArray>)` well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?
   
   I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetExec::statistics()` does not read statistics for many column types (like timstamps, strings, etc) [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8295:
URL: https://github.com/apache/arrow-datafusion/issues/8295#issuecomment-1821798180

   Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org