You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/07 09:00:34 UTC

[GitHub] [arrow] rdettai commented on pull request #8300: ARROW-10135: [Rust] [Parquet] Refactor file module to help adding sources

rdettai commented on pull request #8300:
URL: https://github.com/apache/arrow/pull/8300#issuecomment-704796994


   The discussion with @alamb about the need for an intermediate layer when reading a parquet file is discussed on [JIRA](https://issues.apache.org/jira/browse/ARROW-10135)
   
   The highlights of the current implementation:
   - The public API has changed, but keeps working for `File` and `Path` thanks to the corresponding trait implementations. `Cursor` cannot be used any more because it requires data copies when being passed around with `clone()` (this was already the case before in the implem of `TryClone` for `Cursor<Vec<u8>>`).
   - I have added a custom cursor type (`SliceableCursor`) that allows to generate cursor slices without cloning the underlying data. This can be used to read in-memory files. I guess it could be made more generic, but this would be for convenience only and I find it simple and clear as is.
   - I have separated the implem (`SerializedFileReader`, `SerializedRowGroupReader`...) from the traits (`FileReader`, `RowGroupReader`...) for more clarity. I know that this is not how the code base is structured in the rest of the project but I tend to get lost in these huge files with millions of struct/trait/impl blocks. I'm very much open to suggestion about this point!
   - There is nothing about async/parallelism yet, I have to think about it a little bit.
   
   @alamb : can you take a look at the new `ChunckReader` trait and how it is integrated to the rest of the reader? What do you think about it? 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org