You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/16 10:14:22 UTC

[GitHub] [arrow-datafusion] thinkharderdev commented on issue #2489: Consider adopting IOx ObjectStore abstraction

thinkharderdev commented on issue #2489:
URL: https://github.com/apache/arrow-datafusion/issues/2489#issuecomment-1127484356

   > Yeah as alluded to by @alamb, my plan is to get the iox code released to crates.io so that DataFusion _could_ use it.
   > 
   > There would then be a couple of potential courses of action for DataFusion:
   > 
   > * Do nothing 😄
   > * Migrate to using the `object_store` crate to fetch parquet files to local disk. This would potentially fetch more bytes from object storage, but as described in [RFC: Spill-To-Disk Object Storage Download #2205](https://github.com/apache/arrow-datafusion/issues/2205) this may actually be faster than the current approach. It would also be temporary pending [Push-Based Parquet Reader arrow-rs#1605](https://github.com/apache/arrow-rs/issues/1605)
   > * Wait for [Push-Based Parquet Reader arrow-rs#1605](https://github.com/apache/arrow-rs/issues/1605) and then migrate to using the `object_store` crate
   
   Wrt fetching to local disk, we have an implementation of (datafusion) `ObjectStore` in our project which adopts the S3A approach to minimize the number of small range requests. Basically, we set a minimum chunk size for S3 reads (usually 64K). If a read of less than 64K is requested, we go ahead and fetch 64K and buffer it in memory. Subsequent reads that fall within that buffer are returned from the in-memory buffer. This minimizes the overhead of small range requests from the `PageIterator` while still avoiding reads of columns not required for the query. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org