You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/07 20:52:03 UTC

[GitHub] [arrow] westonpace commented on issue #31174: [C++] Reduce directory and file IO when reading partition parquet dataset with partition key filters

westonpace commented on issue #31174:
URL: https://github.com/apache/arrow/issues/31174#issuecomment-1458858923

   I agree that the ideal place to fix this will probably be in C++. #9670 helps once the dataset has been discovered.  However, it does not help with dataset discovery which is what I think this issue is referring to.  Dataset discovery is what happens in pyarrow when you run [`pyarrow.dataset.dataset(...)`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html#pyarrow.dataset.dataset).  Note that this function does not have a filter argument at all.
   
   Setting `exclude_invalid_files=False` should help somewhat (then it won't try and open every single file).
   
   Another issue, specifically related to S3, is that we are listing directories over S3 in an inefficient manner, especially if the script is running in a different region (or outside AWS entirely).  This is tracked by https://github.com/apache/arrow/issues/34213


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org