You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "eeroel (via GitHub)" <gi...@apache.org> on 2023/09/25 13:51:25 UTC

[GitHub] [arrow] eeroel opened a new issue, #37857: Allow passing file sizes to FileSystemDataset from Python

eeroel opened a new issue, #37857:
URL: https://github.com/apache/arrow/issues/37857

   ### Describe the enhancement requested
   
   When reading Parquet files from table formats such as Delta Lake, file sizes are already known from the table format metadata. However, when building a dataset from fragments using `https://arrow.apache.org/docs/python/generated/pyarrow.dataset.FileFormat.html#pyarrow.dataset.FileFormat.make_fragment`, there is no way to inform Pyarrow about the file sizes, and this leads to unnecessary HEAD requests in the case of S3. There is already support in Arrow for specifying the file size to avoid these requests to S3, but as far as I can see this is not exposed to PyArrow: https://github.com/apache/arrow/pull/7547
   
   (As a side note, it seems that those HEAD requests in S3Filesystem are always executed on the same thread, which leads to poor concurrency when reading multiple files. Is this a known issue?)
   
   I can try to put together a PR with some kind of an implementation.
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Allow passing file sizes to FileSystemDataset from Python [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.

pitrou closed issue #37857: [Python] Allow passing file sizes to FileSystemDataset from Python
URL: https://github.com/apache/arrow/issues/37857


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org