You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/12 15:33:51 UTC

[GitHub] [arrow-rs] tustvold opened a new issue #1163: Parquet Read/Write Traits

tustvold opened a new issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Currently and `ParquetReader` `ParquetWriter` traits pass around immutable references. They then make use of a combination of `TryClone` and `RefCell` in order to placate the borrow checker.
   
   Aside from being somewhat surprising, it prevents creating a `ParquetWriter` from an `&mut File`.
   
   **Describe the solution you'd like**
   
   I would like to be able to create `ParquetReader` with anything implementing `std::io::Seek` and `std::io::Read`. Similarly I would like to be able to create `ParquetWriter` with anything implementing `std::io::Seek` and `std::io::Write`. These are standard traits within the Rust ecosystem and supporting them will simplify interoperation with it.
   
   The blanket implementations on these traits will for free allow using `ParquetReader` and `ParquetWriter` with mutable references, which as an added bonus avoids the somewhat confusing property where cloned file descriptors share the same seek position.
   
   **Describe alternatives you've considered**
   
   Preserve the current behaviour
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1086129297


   Hi @rdettai, I realize I never responded directly to your comment. I have been documenting my experiments on the parquet API in #1473, which I think overlaps a fair deal with the points you raise about the object store interface and introducing more parallelism into the parquet reader. I would be very interested in hearing any thoughts you might have


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1011246439


   Aah yes, the `FileReader` API expects to be able to give out `RowGroupReader` which in turn give out `PageReader`. These are all owned constructs and so expect to be able to pass around `Arc<FileReader>`.
   
   However, looking at the code I'm not sure it can actually be used concurrently. It doesn't appear to have any synchronisation, and yet is using the same underlying file descriptor. I think it might be possible to make the code race if used in such a way... Some testing is needed :thinking: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] rdettai commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1066179351


   Hi @tustvold ! I am the weird mind who introduced the ChunkReader 😄. 
   
   The main benefit compared to a regular reader is that you can specify the size of the chunk you plan to read, which enables you to set the right range when you call GET on the object store. Not sure how we could achieve this with plain `std::io::Read`. But I ended up offloading the download scheduling to a separate module anyway, and I think that this is what you will want to do in most cases to optimize your link with the object store (this is also what is done in IOx, isn't it?).
   
   Another initial goal was indeed to try to achieve parallelism between columns, but I never succeeded because the entire structure of the parquet reader was against it, and I didn't have enough Rust experience to fight it 😉.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1011227138


   FWIW I think it would be great to avoid all the custom abstractions 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1012418298


   Also related https://github.com/apache/arrow-rs/issues/937


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1011957344


   I had a brief play around with this and found the following.
   
   The write side is serial, and so it should be possible to use standard library abstractions. The current trait topology will likely require using `Rc<RefCell<W>>` or similar. I tried using mutable borrows, but this runs into issues as the types need to be boxed in order to be used in traits (due to lack of GATs) but by value trait methods (i.e. `fn close(self)`) aren't object safe, which makes the ergonomics of such an API suck as you need to manually `std::mem::drop`.
   
   The read side is more complicated, the problem can be seen in `SerializedRowGroupReader::get_column_page_reader`. This wants to return a `Box<dyn PageReader>` which can be used asynchronously (although not concurrently) with respect to others from the same row group. This is what requires `FileSource`, we want buffered reads on a shared file descriptor.
   
   It occurs to me that one of the things an async API able to support object stores will need is a sparse file abstraction, ultimately this is what the read path wants. I'll therefore park this for now, and see what shakes out of that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold edited a comment on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1011246439


   Aah yes, the `FileReader` API expects to be able to give out `RowGroupReader` which in turn give out `PageReader`. These are all owned constructs and so expect to be able to pass around `Arc<FileReader>`.
   
   However, looking at the code I'm not sure it can actually be used concurrently. It doesn't appear to have any synchronisation, and yet is using the same underlying file descriptor. I think it might be possible to make the code race if used in such a way... Some testing is needed :thinking: 
   
   Edit: In fact it is impossible to use the reader in this way as the `FileReader` and friends return boxed trait objects, without Send bounds. This is how `FileSource` can get away with using a `RefCell`, the whole chain isn't `Send` and therefore cannot be sent to another thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #1163: Use Standard Library IO Abstractions in Parquet

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1163:
URL: https://github.com/apache/arrow-rs/issues/1163#issuecomment-1011226718


   I think the difference I have observed in the past was that the parquet abstractions also allow `Clone` -- as I recall it was to allow clients to read from different columns concurrently (each column would get its own Clone of the I/O object -- aka it would get its own file descriptor) so they could be read / advanced concurrently
   
   There may well be a better way to handle this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org