You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/22 19:16:31 UTC

[GitHub] [arrow-rs] tustvold opened a new issue #1474: Replace Custom Buffer Implementation with Bytes in Parquet

tustvold opened a new issue #1474:
URL: https://github.com/apache/arrow-rs/issues/1474


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   The parquet crate currently rolls its own `ByteBufferPtr`, `BufferPtr`, `Buffer`, etc... and this feels like functionality that could be provided by an upstream crate.
   
   **Describe the solution you'd like**
   
   The [bytes](https://docs.rs/bytes/latest/bytes/) crate has wide ecosystem adoption and supports most of the necessary functionality. It does not support memory tracking, but I'm not sure this is being used given the `util::memory` module was made experimental and therefore not public in https://github.com/apache/arrow-rs/pull/1134 and there hasn't been any wailing nor gnashing of teeth resulting from this.
   
   **Describe alternatives you've considered**
   
   Continue to use custom buffer abstractions.
   
   Tagging @sunchao as I believe he added this code all the way back in 2018 :sweat_smile: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #1474: Replace Custom Buffer Implementation with Bytes in Parquet

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #1474:
URL: https://github.com/apache/arrow-rs/issues/1474#issuecomment-1076738139


   I think using `Bytes` would have many advantages (such as potentially avoiding copies from data that are read from remote object stores. etc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on issue #1474: Replace Custom Buffer Implementation with Bytes in Parquet

Posted by GitBox <gi...@apache.org>.
sunchao commented on issue #1474:
URL: https://github.com/apache/arrow-rs/issues/1474#issuecomment-1076500538


   Thanks. Yes these are pretty out-dated and I think it will be good to replace them. Have you thought about just use `MutableBuffer` from Arrow side too? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me edited a comment on issue #1474: Replace Custom Buffer Implementation with Bytes in Parquet

Posted by GitBox <gi...@apache.org>.
nevi-me edited a comment on issue #1474:
URL: https://github.com/apache/arrow-rs/issues/1474#issuecomment-1076549007


   If possible, we could use Arrow's buffer based on the `arrow` feature, then use some abstraction (I'd be fine with `bytes`) for the other cases. The perf cliff is whenever we create multiple small `ByteBuffer` instances (e.g. representing vec!["hello", "there"]` as 2 instances instead of a single `ByteBuffer` with offsets into the 2 values. I think having a single buffer per page/row group would be helpful.
   
   The upside of using Arrow's buffer is minimising/eliminating data copies. I was able to improve the Arrow side here (https://github.com/apache/arrow-rs/pull/820), and see @alamb's comment (https://github.com/apache/arrow-rs/pull/820#discussion_r724365907).
   
   > I have a long-term hope to eventually phase out MutableBuffer and replace it with a typed construction that is easier to use without unsafe. Something with a similar interface to the ScalarBuffer I added to parquet might be a candidate
   
   This would be great, as it seems that a lot of the safety (and some perf) issues lie there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on issue #1474: Replace Custom Buffer Implementation with Bytes in Parquet

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #1474:
URL: https://github.com/apache/arrow-rs/issues/1474#issuecomment-1076549007


   If possible, we could use Arrow's buffer based on the `arrow` feature, then use some abstraction (I'd be fine with `bytes`) for the other cases. The perf cliff is whenever we create multiple small `ByteBuffer` instances (e.g. representing vec!["hello", "there"]` as 2 instances instead of a single `ByteBuffer` with offsets into the 2 values. I think having a single buffer per page/row group would be helpful.
   
   The upside of using Arrow's buffer is minimising/eliminating data copies. I was able to improve the Arrow side here (https://github.com/apache/arrow-rs/pull/820), and see @alamb's comment (https://github.com/apache/arrow-rs/pull/820#discussion_r724365907).
   
   > I have a long-term hope to eventually phase out MutableBuffer and replace it with a typed construction that is easier to use without unsafe. Something with a similar interface to the ScalarBuffer I added to parquet might be a candidate
   
   This would be great, as it seems that a lot of the safety (and some perf) issues lie there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1474: Replace Custom Buffer Implementation with Bytes in Parquet

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1474:
URL: https://github.com/apache/arrow-rs/issues/1474#issuecomment-1076513222


   > Have you thought about just use MutableBuffer from Arrow side too?
   
   I hadn't, but I'm not sure it is a good fit
   
   * I would like to use the ecosystem standard abstraction if possible
   * `arrow` is currently an optional dependency of `parquet`
   * `MutableBuffer` is really designed for arrays of primitives, not raw byte arrays
   * I have a long-term hope to eventually phase out `MutableBuffer` and replace it with a typed construction that is easier to use without unsafe. Something with a similar interface to the ScalarBuffer I added to parquet might be a candidate
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org