You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/22 12:50:00 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #1605: Push-Based Parquet Reader

tustvold opened a new issue, #1605:
URL: https://github.com/apache/arrow-rs/issues/1605

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   `SerializedFileReader` is currently created with a `ChunkReader` which looks like
   
   ```
   pub trait ChunkReader: Length + Send + Sync {
       type T: Read + Send;
       /// get a serialy readeable slice of the current reader
       /// This should fail if the slice exceeds the current bounds
       fn get_read(&self, start: u64, length: usize) -> Result<Self::T>;
   }
   ```
   
   The process for reading a file is then
   
   * `SerializedFileReader::new` will call `footer::parse_metadata`
     * `parse_metadata` will
       * Call `ChunkReader::get_read` with the final 64 KB byte range, and read this to a buffer
       * Determine the footer length
       * Potentially call `ChunkReader::get_read` to read the remainder of the footer, and read this to a buffer
   *  `SerializedFileReader::get_row_iter` will return a `RowIter` which for each row group
     * Call `SerializedRowGroupReader::new` which will
       * Call `ChunkReader::get_read` with the byte range of each column chunk
   
   There are two major options to apply this to files in object storage
   
   1. Fetch the entire file to local disk or memory and pass it to `SerializedFileReader`
   2. Convert `ChunkReader::get_read` to a range request to object storage
   
   The first option is problematic as it cannot use pruning logic to reduce the amount of data fetched from object storage.
   
   The second option runs into two problems:
   
   1. The interface is not async and blocking a thread on network IO is not ideal
   2. Lots of small range requests per file adding cost and latency 
   
   **Describe the solution you'd like**
   
   I would like to decouple the parquet reader entirely from IO concerns, allowing downstreams complete freedom to decide how they want to handle this. This will allow the reader to support a wide variety of potentially data access patterns:
   
   * Sync/Async Disk IO
   * Sync/Async Network IO
   * In-memory/mmapped parquet files
   * Interleaved row group decode with fetching the next row group
   
   ## Footer Decode
   
   Introduce functions to assist parsing the parquet metadata
   
   ```
   /// Parses the 8-bytes parquet footer and returns the length of the metadata section
   fn parse_footer(footer: [u8; 8]) -> Result<usize> {}
   
   /// Parse metadata payload
   fn parse_metadata(metadata: &[u8]) -> Result<ParquetMetaData> {}
   ```
   
   This will allow callers to obtain `ParquetMetaData` regardless of how they choose to fetch the corresponding bytes
   
   ## ScanBuilder / Scan
   
   Next introduce a `ScanBuilder` and accompanying `Scan`.
   
   ```
   /// Build a [`Scan`]
   ///
   /// Eventually this will support predicate pushdown (#1191)
   pub struct ScanBuilder {}
   
   impl ScanBuilder {
     pub fn new(metadata: Arc<ParquetMetaData>) -> Self {}
     
     pub fn with_projection(self, projection: Vec<usize>) -> Self {}
     
     pub fn with_row_groups(self, groups: Vec<usize>) -> Self {}
     
     pub fn with_range(self, range: Range<usize>) -> Self {}
     
     pub fn build(self) -> Scan {}
   }
   
   pub struct Scan {}
   
   impl Scan {
     /// Returns a list of byte ranges needed
     pub fn ranges(&self) -> &[Range<usize>] {}
     
     /// Perform the scan returning a [`ParquetRecordBatchReader`] 
     pub fn execute<R: ChunkReader>(self, reader: R) -> Result<ParquetRecordBatchReader> {}
   }
   ```
   
   Where `ParquetRecordBatchReader` is the same type returned by the current `ParquetFileArrowReader::get_record_reader`, and is just an `Iterator<Item=ArrowResult<RecordBatch>>` with a `Schema`.
   
   *This design will only support the arrow use-case, but I couldn't see an easy way to add this at a lower level without strange introducing inconsistencies when not scanning the entire file*
   
   **Describe alternatives you've considered**
   
   #1154 added an async reader that uses the `AsyncRead` and `AsyncSeek` traits to read individual column chunks into memory from an async source. This is the approach taken by arrow2, with its [range_reader](https://docs.rs/range-reader/latest/range_reader/) abstraction. This was not found to perform particularly well (#1473).
   
   #1473 proposed an async reader with prefetch functionality, and was also suggested by @alamb in https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1097902378. This is similar to the new FSDataInputStream [vectored IO API](https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1100069800) in the Hadoop ecosystem.  This was implemented in #1509 and found to perform better, but still represented a non-trivial performance regression on local files. 
   
   **Additional Context**
   
   The motivating discussion for this issue can be found https://github.com/apache/arrow-datafusion/issues/2205
   
   @mateuszkj clearly documented the limitations of the current API https://github.com/datafusion-contrib/datafusion-objectstore-s3/pull/53


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1605: Push-Based Parquet Reader

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1605:
URL: https://github.com/apache/arrow-rs/issues/1605#issuecomment-1107829356

   > Interleaved row group decode with fetching the next row group
   
   Is the idea that the calling code would create a `Scan` for each row group and the execute those `Scan`s in whatever order was desired?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1605: Push-Based Parquet Reader

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1605:
URL: https://github.com/apache/arrow-rs/issues/1605#issuecomment-1107846848

   > As for SerializedRowGroupReader the challenge is the RowGroupReader trait exposes APIs for column projection, re
   
   👍  makes sense
   
   > I am somewhat apprehensive about providing an async version, as the whole intent is to let users handle what level of buffering/pre-fetch makes sense for their use case, but I guess with sufficient disclaimers...
   
   Yeah I am thinking about the "I want to try out reading parquet from s3 just to see if it works" rather than "I am set on using parquet library and I am now trying to squeeze the most pefroamcne"
   
   > W.r.t to users constructing per-row group, entirely up to their use-case, if they want to they can, if they don't want to, they don't need to. It was more me trying to demonstrate the flexibility afforded, than prescribing anything
   
   Right -- I agree -- I was just confirming my understanding of what would be possible with the API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1605: Push-Based Parquet Reader

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1605:
URL: https://github.com/apache/arrow-rs/issues/1605#issuecomment-1269576252

   The desired IO decoupling has been achieved through AsyncFileReader and its integration with ParquetRecordBatchStream


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #1605: Push-Based Parquet Reader

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #1605: Push-Based Parquet Reader
URL: https://github.com/apache/arrow-rs/issues/1605


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #1605: Push-Based Parquet Reader

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1605:
URL: https://github.com/apache/arrow-rs/issues/1605#issuecomment-1107829153

   All in all I like this proposal. Thank you for writing it down. I know it is not based on @jorgecarleitao  https://github.com/jorgecarleitao/parquet2 but it is not a dissimilar API where the metadata reading / management is handled outside the main parquet decoding logic -- see the [example](https://github.com/jorgecarleitao/parquet2/blob/main/examples/read_metadata.rs#L46-L64). I see this similarity as a good sign. 👍 
   
   I think it is important to sketch out what the interface for existing users of the `ParquetRecordBatchReader` would be. Not just for helping migration, but to ensure that all use cases are satisfied (I am happy to help with this).
   
   Maybe we can provide functions like the following for basic use to both ease migration and to demonstrate how to use this API:
   ```rust
   fn scan_file(file: impl ChunkReader) -> Result<ParquetRecordBatchReader> {
   
   }
   ```
   
   ```rust
   async fn async_scan_file(file: impl AsyncRead) -> Result<ParquetRecordBatchReader> {
   // buffers / fetches whatever is needed locally to memory
   
   }
   ```
   
   > This design will only support the arrow use-case, but I couldn't see an easy way to add this at a lower level without introducing API inconsistencies when not scanning the entire file
   
   What about offering an function from `Scan` that sends back `SerializedRowGroupReader`?
   
   https://docs.rs/parquet/12.0.0/parquet/file/serialized_reader/struct.SerializedRowGroupReader.html
   
   ```rust
     /// Perform the scan returning a [`ParquetRecordBatchReader`] 
     pub fn execute_serialized<R: ChunkReader>(self, reader: R) -> Result<Iterator<Item=SerializedRowGroupReader>> {}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1605: Push-Based Parquet Reader

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1605:
URL: https://github.com/apache/arrow-rs/issues/1605#issuecomment-1107838816

   Adding some free functions to assist migration makes sense to me. It should be pretty much a drop-in replacement.
   
   I am somewhat apprehensive about providing an async version, as the whole intent is to let users handle what level of buffering/pre-fetch makes sense for their use case, but I guess with sufficient disclaimers...
   
   As for SerializedRowGroupReader the challenge is the RowGroupReader trait exposes APIs for column projection, etc... which then gets rather confusing if you have similar concepts on the Scan. I'd rather a clean break, than trying to shoehorn an existing API.
   
   W.r.t to users constructing per-row group, entirely up to their use-case, if they want to they can, if they don't want to, they don't need to


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org