You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/04/24 17:07:30 UTC

[GitHub] [arrow-rs] tustvold commented on issue #4118: `ParquetRecordBatchReader` reads overlapping byte ranges

tustvold commented on issue #4118:
URL: https://github.com/apache/arrow-rs/issues/4118#issuecomment-1520536549

   This is a consequence of #2464 which causes ChunkReader to be created per page, instead of per row group. This change was made to enable page-level predicate push down. We should definitely improve the documentation around ChunkReader, and its implicit assumptions regarding buffering at the application and/or OS level. I will add this to my list.
   
   The reason for the overlapping byte ranges, is that if the `OffsetIndex` isn't read, the reader doesn't know where the pages are located or even how many there are, only the end position of the column chunk. It therefore has to assume a given page may run to the end of the range. If you enable reading the [PageIndex](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index) it shouldn't perform overlapping reads (although it will now need to perform IO to read the page index).
   
   Taking a step back I wonder if you've considered using the [async_reader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/index.html). Not only does this provide a native async interface, but the [AsyncFileReader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html) interface naturally lends itself to IO pre-fetching. There is even out of the box integration with [object_store](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetObjectReader.html). 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org