You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "whyzdev (via GitHub)" <gi...@apache.org> on 2023/03/07 01:01:02 UTC

[GitHub] [arrow] whyzdev commented on issue #31174: [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters

whyzdev commented on issue #31174:
URL: https://github.com/apache/arrow/issues/31174#issuecomment-1457302230

   looks like this is still an issue as of 11.0.0, but may be closed 
   #16972 is still open, where filtered FileSystemDataset and caching were suggested/mentioned in the comments.
   Caching may already be done in Python user code, for example via monkey patching pyarrow dataset._filesystem_dataset. But this is at full dataset level, and difficult if not impossible to updated incrementally in Python, when one or a few partitions change frequently to avoid full eviction. The FileSystemDataset and underlying objects are in C++ not Python. So we may need some native support for caching by Arrow API.
   
   Btw #9670 since 4.0.0 seemed to be a separate enhancement for reading table but not for speeding up the loading of FileSystemDataset.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org