You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/27 17:35:32 UTC

[GitHub] [arrow] lidavidm edited a comment on pull request #10145: ARROW-12522: [C++] Add ReadRangeCache::WaitFor

lidavidm edited a comment on pull request #10145:
URL: https://github.com/apache/arrow/pull/10145#issuecomment-827784250


   > I wonder if, given a bunch of small record batches, we might sometimes want to coalesce across record batches. I think the current design preempts that. Although I think there would be more challenges than just this tool to tackle that problem.
   
   So overall, the use pattern for this class is:
   
   1. `Cache()` all byte ranges you expect to read in the future, in the granularity that you expect to read them. So you'd call `Cache` for every record batch (IPC), or for every column chunk (Parquet).
   2. `WaitFor()` the ranges that you need. For IPC, this would again be one record batch; for Parquet, this would be one row group's worth of column chunks. This can be done in parallel/reentrantly and is why we need the lock in the lazy variant.
   3. `Read` the ranges that you need.
   
   Since all the byte ranges are given up front, you do get coalescing across record batches/column chunks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org