You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/12 18:06:17 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #2054: Inconsistent Builder Constructors

tustvold opened a new issue, #2054:
URL: https://github.com/apache/arrow-rs/issues/2054

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   The various builder constructors take capacities to pre-allocate. However, they aren't consistent about whether they take a capacity in terms of elements or bytes, or what these are capacities for.
   
   * PrimitiveBuilder::new - capacity in elements of T
   * BooleanBuilder::new - capacity in elements of bool
   * DecimalBuilder::new - capacity in terms of bytes
   * FixedSizeBinaryBuilder::new - capacity in terms of bytes
   * FixedSizeListBuilder::with_capacity - capacity in terms of list elements
   * GenericStringBuilder::new - capacity in terms of bytes of string data
   * GenericStringBuilder::with_capacity - capacity in terms of number of items and bytes of string data
   * MapBuilder::with_capacity - capacity in terms of number of items
   * UnionBuilder::with_capacity - capacity in terms of number of items
   
   **Describe the solution you'd like**
   
   I would like to propose the following:
   
   * Remove `capacity` from the `new` constructors, instead using a static default capacity (e.g. 1024)
   * Make `with_capacity` take capacities in terms of elements
   
   This has a couple of advantages:
   
   * Avoids subtly changing the meaning of parameters passed to pre-existing constructors
   * Brings the Builders closer into alignment with `Vec` (#1850)
   
   The only major disadvantage being that it results in API churn.
   
   **Describe alternatives you've considered**
   
   We could not do this, but the current situation leads to hard to spot performance bugs.
   
   **Additional context**
   
   Noticed whilst reviewing #2038 
   
   Thoughts @alamb @viirya @jhorstmann 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1226150674

   I think we can leave the BufferBuilder constructors for now, I think all that remains now are the builders for BinaryArray and StringArray


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1183089415

   I agree the proposal sounds good as well


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1217678867

   Hi, this is definitely still an issue. I would recommend working on each builder separately in turn, perhaps starting with the more esoteric builders such as unions, lists, and then working through to strings and eventually primitives. The latter builders are likely to cause significantly more churn and so leaving them to last will help I think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #2054: Inconsistent Builder Constructors
URL: https://github.com/apache/arrow-rs/issues/2054


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1190019478

   While working on #2104, I find it might be good to let users decide both the `value_capacity` and the number of elements:
   ```rust
   pub struct GenericBinaryBuilder<OffsetSize: OffsetSizeTrait> {
       value_builder: UInt8BufferBuilder,
       offsets_builder: BufferBuilder<OffsetSize>,
       bitmap_builder: BooleanBufferBuilder,
   }
   
   impl<OffsetSize: OffsetSizeTrait> GenericBinaryBuilder<OffsetSize> {
       /// Creates a new `GenericBinaryBuilder` with at least 
       /// `num_elements` binary slots in the array and
       /// `value_capacity` bytes in the values buffer.
       pub fn new(num_elements: Option<usize>, value_capacity: Option<usize>) -> Self {
           Self { 
               value_builder: UInt8BufferBuilder::new(value_capacity.unwrap_or(1024)), 
               offsets_builder: BufferBuilder::<OffsetSize>::new(num_elements.unwrap_or(1023) + 1), 
               bitmap_builder: BooleanBufferBuilder::new(num_elements.unwrap_or(1023) + 1) 
           }
       }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1190802252

   Thanks, I've updated the ticket to be hopefully slightly clearer. We should not remove the ability to pass a value_capacity to with_capacity, as we currently support for StringArray :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] psvri commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
psvri commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1220433031

   So based on this issue the following PR's are created. Some I have kept in draft, but the rest are ready for review. The buffer builder refactoring I will take it last once the rest are approved.
   
   - [ ] #2488
   - [ ] #2515
   - [ ] #2516
   - [ ] #2517 , will remove from draft once the above three are merged
   - [ ] #2518 , will remove from draft once #2517 is merged
   - [ ]  Todo Buffer builder
   
   On a side note , I have replace all `new` calls with `with_capacity` in the API's so that the logic would remain the same.
   
   Only in test module I have used both API's to test them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] psvri commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
psvri commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1227444180

   Okay


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] psvri commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
psvri commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1224441740

   Hello @tustvold ,
   
   Just confirming since it was not mentioned in the ticket description. should I also refactor buffer builder constructors. I think thats the only one left.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1227411819

   I think it is just a case of removing the capacity parameter from `new` and leaving `with_capacity` alone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jhorstmann commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1182527781

   Sounds good, especially forcing users to specify both item and nested capacities for List/String/Binary arrays. My gut feeling also says that 1024 could be a good default, I'm assuming that arrays will usually be much larger than general purpose collection types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] psvri commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
psvri commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1227236297

   Those 2 builders already have new and with capacity, But since the sizes of strings are not constant the API was taking the number of bytes to pre allocate the buffer as a parameter. Whats the best way to handle this. 
   
   Shall we remove the buffer_capacity parameters from the constructors and allocate a fixed buffer of size 1024 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] psvri commented on issue #2054: Inconsistent Builder Constructors

Posted by GitBox <gi...@apache.org>.
psvri commented on issue #2054:
URL: https://github.com/apache/arrow-rs/issues/2054#issuecomment-1217023721

   Hello,
   
   Is this still a valid issue ? . If so I would like to pick this up.
   
   Given that this can lead to big code changes, may I request the team to split this issue into multiple smaller issues or tasks so that its easier to implement and review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org