You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/01/26 17:49:29 UTC

[GitHub] [arrow] westonpace opened a new issue, #33888: [Python][C++] Add controls to disable metadata caching in datasets

westonpace opened a new issue, #33888:
URL: https://github.com/apache/arrow/issues/33888

   ### Describe the enhancement requested
   
   Currently, when scanning a dataset, in `ParquetFileFragment`, we save the metadata (and statistics I think) as we scan.  The idea is that this will make repeated scans of the dataset faster.
   
   However, in some cases, this metadata can be quite large (or there can be a lot of files), and this caching ends up using too much memory.  We should have some kind of option or flag that allows us to disable this caching of metadata (it could be at the dataset level and doesn't have to be parquet specific although parquet would be the only format with a special implementation) to improve memory usage in these cases.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org