You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/15 12:13:46 UTC

[GitHub] [arrow-datafusion] steveloughran commented on issue #2205: RFC: Spill-To-Disk Object Storage Download

steveloughran commented on issue #2205:
URL: https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800

   choosing when/how to scan and prefetch in object stores is a real tricky business
   
   abfs and gcs connectors do forward prefetch in block sizes you can config in hadoop site/job settings, cache into memory. The more prefetching you do, the more likely a large process will run out of memory.
   
   s3a doesn't and we've been getting complaints about lack of buffering in the client. it does have different seek policies, look at fs.s3a.experimental.fadvise and fs.s3a.readahead.range
   
   You can set seek policy cluster-wise, or, if you use the openFile() api, when opening specific files.
   
   we have two big bits of work on going there how to help mitigate things, both in feature branches right now
   * HADOOP-18103 vectored IO API. It will be available for all FSDataInputStream; object stores can improve with range coalescing and fetching of different ranges in parallel (s3a will be first for this).
   * HADOOP-18028. High performance S3A input stream with prefetching & caching to local disk. feature branch works, but for broader adoption we again need to deal with memory/buffer use and some other issues.
   Really good to have you involved in reviewing/testing the vectored IO API (yes, we want a native binding too), the prefetching work, and indeed if we can get good traces of how your library reads files.
   
   Note also s3a and abfs connectors connect/report stats through the IOStatistics interface. Even if you build against Hadoop versions which don't have that,
   1.  if you call toString() on the streams you get a good summary of what IO took place in that stream only. log this, at debug
   2. on hadoop 3.3.2, set "fs.iostatistics.logging.level"; to info and you get full fs stats dump when the fs instance is closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org