You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/25 15:30:43 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #12501: [python] Parallel parquet metadata resolution?

jorisvandenbossche commented on issue #12501:
URL: https://github.com/apache/arrow/issues/12501#issuecomment-1050954366


   It might not be the exact thing you need for Ray, but a related issue is that the actual "dataset discovery" (listing all files, etc) is currently single threaded, and that's something that might be possible to parallelize on Arrow's side: https://issues.apache.org/jira/browse/ARROW-8137
   
   If we have the option to force to load the metadata already during dataset discovery (instead of later when accessed), that could also speed-up the serialization of the fragments (since all metadata will already be read at that point).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org