You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/24 12:14:50 UTC

[GitHub] [arrow-rs] alamb commented on issue #1605: Push-Based Parquet Reader

alamb commented on issue #1605:
URL: https://github.com/apache/arrow-rs/issues/1605#issuecomment-1107829153

   All in all I like this proposal. Thank you for writing it down. I know it is not based on @jorgecarleitao  https://github.com/jorgecarleitao/parquet2 but it is not a dissimilar API where the metadata reading / management is handled outside the main parquet decoding logic -- see the [example](https://github.com/jorgecarleitao/parquet2/blob/main/examples/read_metadata.rs#L46-L64). I see this similarity as a good sign. 👍 
   
   I think it is important to sketch out what the interface for existing users of the `ParquetRecordBatchReader` would be. Not just for helping migration, but to ensure that all use cases are satisfied (I am happy to help with this).
   
   Maybe we can provide functions like the following for basic use to both ease migration and to demonstrate how to use this API:
   ```rust
   fn scan_file(file: impl ChunkReader) -> Result<ParquetRecordBatchReader> {
   
   }
   ```
   
   ```rust
   async fn async_scan_file(file: impl AsyncRead) -> Result<ParquetRecordBatchReader> {
   // buffers / fetches whatever is needed locally to memory
   
   }
   ```
   
   > This design will only support the arrow use-case, but I couldn't see an easy way to add this at a lower level without introducing API inconsistencies when not scanning the entire file
   
   What about offering an function from `Scan` that sends back `SerializedRowGroupReader`?
   
   https://docs.rs/parquet/12.0.0/parquet/file/serialized_reader/struct.SerializedRowGroupReader.html
   
   ```rust
     /// Perform the scan returning a [`ParquetRecordBatchReader`] 
     pub fn execute_serialized<R: ChunkReader>(self, reader: R) -> Result<Iterator<Item=SerializedRowGroupReader>> {}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org