You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/12 06:05:30 UTC

[GitHub] [arrow-rs] pacman82 opened a new issue, #1691: Make current position available in `FileWriter`.

pacman82 opened a new issue, #1691:
URL: https://github.com/apache/arrow-rs/issues/1691

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   I would like a way to track the file size of a parquet file I am writing, so I can split my dataset into chunks of roughly the same size. For more context, please see this issue in the downstream `odbc2parquet` crate: <https://github.com/pacman82/odbc2parquet/issues/190>
   
   **Describe the solution you'd like**
   Make the current stream position (i.e bytes currently written into the inner `io::Write`) available in the implementation of `SerializedFileWriter` or even through the `FileWriter` trait.
   
   **Describe alternatives you've considered**
   As a workaround I could create a wrapper of `File` which shares an `Rc<usize>` counter with the application logic.
   
   **Additional context**
   -
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1691: Make current position available in `FileWriter`.

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1691:
URL: https://github.com/apache/arrow-rs/issues/1691#issuecomment-1145176699

   Sorry for confusing things, compressed size is the correct thing to use. I _think_ the crate might be writing the wrong thing for total_byte_size but that's a separate issue I'll file if/when I confirm it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] pacman82 commented on issue #1691: Make current position available in `FileWriter`.

Posted by GitBox <gi...@apache.org>.

pacman82 commented on issue #1691:
URL: https://github.com/apache/arrow-rs/issues/1691#issuecomment-1147263981

   My test cases and users are happy both. See: <https://github.com/pacman82/odbc2parquet/issues/190>. So `compressed_size` has been indeed what I've been looking for. Closing this issue would be fine with me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] pacman82 commented on issue #1691: Make current position available in `FileWriter`.

Posted by GitBox <gi...@apache.org>.

pacman82 commented on issue #1691:
URL: https://github.com/apache/arrow-rs/issues/1691#issuecomment-1145173167

   Hello @tustvold , thanks for your help here. Now I am a little bit confused. In order to implement <https://github.com/pacman82/odbc2parquet/issues/190> (tl;dr I want to stop writing row groups as soon as the file size surpasses a user defined threshold, and start writing the next row group into a new file) should I add the `compressed_size` of the row groups or used the `total_byte_size` of the flushed groups? What would be the difference between the two? Shouldn't be hard to change on my end. If you do not advice otherwise, I'll run with the `compressed_size` and see if my users are happy about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1691: Make current position available in `FileWriter`.

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #1691:
URL: https://github.com/apache/arrow-rs/issues/1691#issuecomment-1144507811

   Thank you for your feedback, and glad to hear the API is moving in a direction that you like :smile: 
   
   I think `RowGroupMetadata::total_byte_size` is probably what you're after, as this will tell you the size of the written row groups. I'm not sure there is an API that would give you access to this whilst writing a file though, but it should be fairly straightforward to add one. Will see what I can come up with


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold closed issue #1691: Make current position available in `FileWriter`.

Posted by GitBox <gi...@apache.org>.

tustvold closed issue #1691: Make current position available in `FileWriter`.
URL: https://github.com/apache/arrow-rs/issues/1691


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] pacman82 commented on issue #1691: Make current position available in `FileWriter`.

Posted by GitBox <gi...@apache.org>.

pacman82 commented on issue #1691:
URL: https://github.com/apache/arrow-rs/issues/1691#issuecomment-1141395705

   Hello, since I opened this feature request `parquet 15.0.0` has been released. Thanks to everybody involved in that effort! While breaking changes are of course a pain, I loved each and every one of them. The API got a lot cleaner and as a consequence so got my use of it. Keep up the good work!
   
   While adapting to the breaking changes, I got the feeling that `RowGroupMetadata::compressed_size` maybe already does what I want. So you can feel free to close this issue. If I am not able to use it downstream to the satisfaction of my users, I'll just open a new one.
   
   Cheers, Markus


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org