You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "suremarc (via GitHub)" <gi...@apache.org> on 2023/04/14 20:23:32 UTC

[GitHub] [arrow-rs] suremarc opened a new issue, #4090: Preload page index for async ParquetObjectReader

suremarc opened a new issue, #4090:
URL: https://github.com/apache/arrow-rs/issues/4090

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Currently the [`ParquetMetaData`](https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaData.html) object has optional fields for the column & offset indexes which are unpopulated at first. When the `ArrowReaderBuilder` is created using `ArrowReaderOptions::with_page_index(true)` it loads the page index at query time. However, this is potentially suboptimal as it incurs additional latency making an extra request (typically to object storage which is high-latency) for each query. 
   
   **Describe the solution you'd like**
   A new method for the `ParquetObjectReader` that toggles loading the page index at construction time, something like this:
   ```rust
   impl ParquetObjectReader {
       pub fn preload_page_index(self, should_preload: bool) -> Self {
           self.preload_page_index = true
       }
   }
   ```
   
   which would trigger conditional logic in the `get_metadata` function to return metadata with the page index already loaded. 
   
   **Describe alternatives you've considered**
   A public async API for deserializing the column & offset index, similar to [`index_reader`](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/index.html) but with async support and integrated with `AsyncFileReader` to enable coalescing of multiple fetches. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4090: Preload page index for async ParquetObjectReader

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4090:
URL: https://github.com/apache/arrow-rs/issues/4090#issuecomment-1552715735

   `label_issue.py` automatically added labels {'parquet'} from #4142


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #4090: Preload page index for async ParquetObjectReader

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4090: Preload page index for async ParquetObjectReader
URL: https://github.com/apache/arrow-rs/issues/4090


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4090: Preload page index for async ParquetObjectReader

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4090:
URL: https://github.com/apache/arrow-rs/issues/4090#issuecomment-1522109505

   :+1: I will spend some time working out a better API for this, #3851 is also related


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4090: Preload page index for async ParquetObjectReader

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4090:
URL: https://github.com/apache/arrow-rs/issues/4090#issuecomment-1550340125

   PR https://github.com/apache/arrow-rs/pull/4216


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] suremarc commented on issue #4090: Preload page index for async ParquetObjectReader

Posted by "suremarc (via GitHub)" <gi...@apache.org>.
suremarc commented on issue #4090:
URL: https://github.com/apache/arrow-rs/issues/4090#issuecomment-1520882689

   This is true, but there is no easy way to deserialize the page index asynchronously. Currently the easiest way I have found to do this is to fetch the relevant page index offsets, create a special implementation of [`ChunkReader`](https://docs.rs/parquet/latest/parquet/file/reader/trait.ChunkReader.html) that holds the serialized page index bytes in memory, and shove the page index data into that special structure, then use the synchronous functions in [`index_reader`](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/index.html) to deserialize it into the column & offset indexes. 
   
   I have found that the above approach works, although extremely hacky, but I'd ask that the maintainers of this library at least consider exposing a built-in way to deserialize the page index in async code. Again, something like [`index_reader`](https://docs.rs/parquet/latest/parquet/file/page_index/index_reader/index.html) but asyncified could work if making additional changes to the `ParquetObjectReader` is not desired. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4090: Preload page index for async ParquetObjectReader

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4090:
URL: https://github.com/apache/arrow-rs/issues/4090#issuecomment-1509756217

   The design of AsyncFileReader is already written in such a way as to allow this, in particular implementations may override [get_metadata](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html#tymethod.get_metadata) and return a Metadata that already has the page index loaded


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org