You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/13 10:44:41 UTC

[GitHub] [arrow-datafusion] alamb commented on issue #2205: RFC: Spill-To-Disk Object Storage Download

alamb commented on issue #2205:
URL: https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1097902278

   I see two major, and somewhat orthogonal usecases:
   
   *Usecase*: Multiple reads of unpredictable column / row group subsets of the same file (e.g. IOx)
   *Optimal*: Read data to local file
   
   *Goal*: Single read of a subset of column/row groups (e.g. Cloud Fuse, other "analytics on S3 parquet files")
   *Optimal*: Read subset of the data that is needed into memory, discard after decode 
   
   I have been hoping our ObjectStore interface would allow for both usecases. 
   
   In terms of the "many small requests to S3" problem, I was imagining that the S3 ObjectStore implementation would implement "pre-fetching" internally (the same way local filesystems do) to coalesce multiple small requests into fewer larger ones.  This strategy is particularly effective if we know what parts of the file are likely to be needed.
   
   Conveniently, the parquet format is quite amenable to this (as once the reader has figured out it wants to scan a row group, it also knows what file data (offsets) it needs). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org