You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "suremarc (via GitHub)" <gi...@apache.org> on 2023/04/24 21:57:45 UTC

[GitHub] [arrow-rs] suremarc commented on issue #4090: Preload page index for async ParquetObjectReader

suremarc commented on issue #4090:
URL: https://github.com/apache/arrow-rs/issues/4090#issuecomment-1520882689

   This is true, but there is no easy way to deserialize the page index asynchronously. Currently the easiest way I have found to do this is to fetch the relevant page index offsets, create a special implementation of [`ChunkReader`](https://docs.rs/parquet/latest/parquet/file/reader/trait.ChunkReader.html) that holds the serialized page index bytes in memory, and shove the page index data into that special structure, then use the synchronous functions in [`index_reader`](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/index.html) to deserialize it into the column & offset indexes. 
   
   I have found that the above approach works, although extremely hacky, but I'd ask that the maintainers of this library at least consider exposing a built-in way to deserialize the page index in async code. Again, something like [`index_reader`](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/index.html) but asyncified could work if making additional changes to the `ParquetObjectReader` is not desired. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org