You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "crepererum (via GitHub)" <gi...@apache.org> on 2023/03/03 10:50:07 UTC

[GitHub] [arrow-datafusion] crepererum opened a new issue, #5466: Rework `ParquetExec::metadata_size_hint`

crepererum opened a new issue, #5466:
URL: https://github.com/apache/arrow-datafusion/issues/5466

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Currently `metadata_size_hint` passed to `ParquetExec` has the following issues:
   
   - **single value:** It is a single value, but a single `ParquetExec` may contain multiple files. These files may require different hints.
   - **serialization:** The value is dropped during protobuf serialization.
   
   Now these two issues are unrelated, but it feels like fixing both in one go makes sense.
   
   **Describe the solution you'd like**
   Move `metadata_size_hint` to `FileScanConfig::file_groups` > `PartitionedFile`:
   
   https://github.com/apache/arrow-datafusion/blob/d11820aa256284f9a817c7b699a548f9c3e1c399/datafusion/core/src/datasource/listing/mod.rs#L51-L64
   
   **Describe alternatives you've considered**
   Leaving the status quo.
   
   **Additional context**
   \-
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5466: Rework `ParquetExec::metadata_size_hint`

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5466:
URL: https://github.com/apache/arrow-datafusion/issues/5466#issuecomment-1471818016

   > get metadata_size_hint value in infer_schema and infer_stats functions of ParquetFormat.
   
   I think it is worth drawing a distinction between the logic for catalog-inference, i.e. ListingTable, from that of query processing, i.e. FileScanConfig. Most practical applications will need a `TableProvider` backed by some sort of catalog for reasonable performance, and this would be an ideal place to store information such as the footer size, schema, statistics, etc... and this can be used to populate `FileScanConfig` accurately. 
   
   For `TableProvider` that don't have access to this information, such as `ListingTable`, I think it is perfectly acceptable to use a single config value for the metadata size hint


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] r4ntix commented on issue #5466: Rework `ParquetExec::metadata_size_hint`

Posted by "r4ntix (via GitHub)" <gi...@apache.org>.
r4ntix commented on issue #5466:
URL: https://github.com/apache/arrow-datafusion/issues/5466#issuecomment-1471796131

   Great suggestion.
   But in addition to moving `metadata_size_hint` to `PartitionedFile`, we also need to consider how to get `metadata_size_hint` value in `infer_schema` and `infer_stats` functions of `ParquetFormat`.
   https://github.com/apache/arrow-datafusion/blob/258af4bf69758b6307d191f072ba88a7b847fbe5/datafusion/core/src/datasource/file_format/parquet.rs#L143-L185
   
   Maybe we need to extend `ObjectMeta` in arrow-rs/object_store repo, such as adding an `extensions` field for user defined per object metadata? 
   @tustvold could you provide some suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] r4ntix commented on issue #5466: Rework `ParquetExec::metadata_size_hint`

Posted by "r4ntix (via GitHub)" <gi...@apache.org>.
r4ntix commented on issue #5466:
URL: https://github.com/apache/arrow-datafusion/issues/5466#issuecomment-1479467422

   > For TableProvider that don't have access to this information, such as ListingTable, I think it is perfectly acceptable to use a single config value for the metadata size hint, without some sort of catalog there isn't really a way around this
   
   @tustvold thinks for the suggestions. 
   
   So your suggestion is the following?
   1. Catalog-inference: Reserve single config value of `metadata_size_hint` for `infer_schema`.
   2. Query processing: Support `metadata_size_hint` in `FileScanConfig` for `create_physical_plan`.
   
   Also, what about `infer_stats`? When `ListingOptions.collect_stat` is true, each file is scanned in the scan logic of `TableProvider`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org