You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/21 02:48:47 UTC

[GitHub] [arrow-rs] HaoYang670 opened a new pull request, #2117: Update `GenericBinaryBuilder` to use buffer builders directly.

HaoYang670 opened a new pull request, #2117:
URL: https://github.com/apache/arrow-rs/pull/2117

   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #2104.
   
   # Rationale for this change
    Don't build binary array from list builder to avoid bugs and extra memory occupation.
   
   # What changes are included in this PR?
   1. Update the struct of `GenericBinaryBuilder`
   2. Remove the method `append_byte` and `append`. (`append_value` and `append_null` can achieve all functionalities)
   3. More tests to cover the all-null case and builder resetting.
   4. Fix some docs
   
   # Are there any user-facing changes?
   Yes, remove 2 pub methods `append_byte` and `append`.
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #2117: Update `GenericBinaryBuilder` to use buffer builders directly.

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2117:
URL: https://github.com/apache/arrow-rs/pull/2117#issuecomment-1190987760

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2117?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2117](https://codecov.io/gh/apache/arrow-rs/pull/2117?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c3a9be3) into [master](https://codecov.io/gh/apache/arrow-rs/commit/3096591520d303eb34a432c82733e86f34999232?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3096591) will **increase** coverage by `0.01%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2117      +/-   ##
   ==========================================
   + Coverage   83.76%   83.78%   +0.01%     
   ==========================================
     Files         225      225              
     Lines       59457    59485      +28     
   ==========================================
   + Hits        49806    49839      +33     
   + Misses       9651     9646       -5     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2117?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/array/builder/generic\_binary\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2117/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvZ2VuZXJpY19iaW5hcnlfYnVpbGRlci5ycw==) | `92.00% <100.00%> (+8.66%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/2117/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/2117/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `64.41% <0.00%> (+0.35%)` | :arrow_up: |
   | [...row/src/array/builder/string\_dictionary\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2117/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvc3RyaW5nX2RpY3Rpb25hcnlfYnVpbGRlci5ycw==) | `91.36% <0.00%> (+0.71%)` | :arrow_up: |
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2117: Update `GenericBinaryBuilder` to use buffer builders directly.

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2117:
URL: https://github.com/apache/arrow-rs/pull/2117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2117: Update `GenericBinaryBuilder` to use buffer builders directly.

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2117:
URL: https://github.com/apache/arrow-rs/pull/2117#issuecomment-1191897311

   Benchmark runs are scheduled for baseline = fd38664e81f0cced38190558a37ca7ecb0155dc7 and contender = 1805b30e620b05bd050a9f2dfcf5ef41e05977ea. 1805b30e620b05bd050a9f2dfcf5ef41e05977ea is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/ff66545fa3fd49d2800622150bd4df0c...9bce63e69caa443498bb1899bb6b14d7/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/01764d53359a4abcbd77f9e321706a33...5a0ff4d9b46d4ba89256c68de6546021/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b931ad0a9a2f45ba8a4541a061a0ed5b...3ffaa74ecb15434bba315ad8f6f6de56/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/557d80073d324d78813e1d43d82d857b...40bca7f448734d5786aad4140b1a74ce/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #2117: Update `GenericBinaryBuilder` to use buffer builders directly.

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #2117:
URL: https://github.com/apache/arrow-rs/pull/2117#discussion_r927170082


##########
arrow/src/array/builder/generic_binary_builder.rs:
##########
@@ -15,63 +15,72 @@
 // specific language governing permissions and limitations
 // under the License.
 
-use crate::array::{
-    ArrayBuilder, ArrayRef, GenericBinaryArray, GenericListBuilder, OffsetSizeTrait,
-    UInt8Builder,
+use crate::{
+    array::{
+        ArrayBuilder, ArrayDataBuilder, ArrayRef, GenericBinaryArray, OffsetSizeTrait,
+        UInt8BufferBuilder,
+    },
+    datatypes::DataType,
 };
 use std::any::Any;
 use std::sync::Arc;
 
-///  Array builder for `BinaryArray`
+use super::{BooleanBufferBuilder, BufferBuilder};
+
+///  Array builder for [`GenericBinaryArray`]
 #[derive(Debug)]
 pub struct GenericBinaryBuilder<OffsetSize: OffsetSizeTrait> {
-    builder: GenericListBuilder<OffsetSize, UInt8Builder>,
+    value_builder: UInt8BufferBuilder,
+    offsets_builder: BufferBuilder<OffsetSize>,
+    null_buffer_builder: BooleanBufferBuilder,
 }
 
 impl<OffsetSize: OffsetSizeTrait> GenericBinaryBuilder<OffsetSize> {
-    /// Creates a new `GenericBinaryBuilder`, `capacity` is the number of bytes in the values
-    /// array
+    /// Creates a new [`GenericBinaryBuilder`].
+    /// `capacity` is the number of bytes in the values array.
     pub fn new(capacity: usize) -> Self {
-        let values_builder = UInt8Builder::new(capacity);
+        let mut offsets_builder = BufferBuilder::<OffsetSize>::new(1024);
+        offsets_builder.append(OffsetSize::zero());
         Self {
-            builder: GenericListBuilder::new(values_builder),
+            value_builder: UInt8BufferBuilder::new(capacity),
+            offsets_builder,
+            null_buffer_builder: BooleanBufferBuilder::new(1024),
         }
     }
 
-    /// Appends a single byte value into the builder's values array.
-    ///
-    /// Note, when appending individual byte values you must call `append` to delimit each
-    /// distinct list value.
-    #[inline]
-    pub fn append_byte(&mut self, value: u8) {
-        self.builder.values().append_value(value);
-    }
-
     /// Appends a byte slice into the builder.
-    ///
-    /// Automatically calls the `append` method to delimit the slice appended in as a
-    /// distinct array element.
     #[inline]
     pub fn append_value(&mut self, value: impl AsRef<[u8]>) {
-        self.builder.values().append_slice(value.as_ref());
-        self.builder.append(true);
-    }
-
-    /// Finish the current variable-length list array slot.
-    #[inline]
-    pub fn append(&mut self, is_valid: bool) {
-        self.builder.append(is_valid)
+        self.value_builder.append_slice(value.as_ref());
+        self.null_buffer_builder.append(true);
+        self.offsets_builder
+            .append(OffsetSize::from_usize(self.value_builder.len()).unwrap());
     }
 
     /// Append a null value to the array.
     #[inline]
     pub fn append_null(&mut self) {
-        self.append(false)
+        self.null_buffer_builder.append(false);

Review Comment:
   Currently, This optimization is only used on primitive builder and boolean builder. 
   We could do a follow-up to use it on all builders.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2117: Update `GenericBinaryBuilder` to use buffer builders directly.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2117:
URL: https://github.com/apache/arrow-rs/pull/2117#discussion_r926879313


##########
arrow/src/array/builder/generic_binary_builder.rs:
##########
@@ -15,63 +15,72 @@
 // specific language governing permissions and limitations
 // under the License.
 
-use crate::array::{
-    ArrayBuilder, ArrayRef, GenericBinaryArray, GenericListBuilder, OffsetSizeTrait,
-    UInt8Builder,
+use crate::{
+    array::{
+        ArrayBuilder, ArrayDataBuilder, ArrayRef, GenericBinaryArray, OffsetSizeTrait,
+        UInt8BufferBuilder,
+    },
+    datatypes::DataType,
 };
 use std::any::Any;
 use std::sync::Arc;
 
-///  Array builder for `BinaryArray`
+use super::{BooleanBufferBuilder, BufferBuilder};
+
+///  Array builder for [`GenericBinaryArray`]
 #[derive(Debug)]
 pub struct GenericBinaryBuilder<OffsetSize: OffsetSizeTrait> {
-    builder: GenericListBuilder<OffsetSize, UInt8Builder>,
+    value_builder: UInt8BufferBuilder,
+    offsets_builder: BufferBuilder<OffsetSize>,
+    null_buffer_builder: BooleanBufferBuilder,
 }
 
 impl<OffsetSize: OffsetSizeTrait> GenericBinaryBuilder<OffsetSize> {
-    /// Creates a new `GenericBinaryBuilder`, `capacity` is the number of bytes in the values
-    /// array
+    /// Creates a new [`GenericBinaryBuilder`].
+    /// `capacity` is the number of bytes in the values array.
     pub fn new(capacity: usize) -> Self {
-        let values_builder = UInt8Builder::new(capacity);
+        let mut offsets_builder = BufferBuilder::<OffsetSize>::new(1024);
+        offsets_builder.append(OffsetSize::zero());
         Self {
-            builder: GenericListBuilder::new(values_builder),
+            value_builder: UInt8BufferBuilder::new(capacity),
+            offsets_builder,
+            null_buffer_builder: BooleanBufferBuilder::new(1024),
         }
     }
 
-    /// Appends a single byte value into the builder's values array.
-    ///
-    /// Note, when appending individual byte values you must call `append` to delimit each
-    /// distinct list value.
-    #[inline]
-    pub fn append_byte(&mut self, value: u8) {
-        self.builder.values().append_value(value);
-    }
-
     /// Appends a byte slice into the builder.
-    ///
-    /// Automatically calls the `append` method to delimit the slice appended in as a
-    /// distinct array element.
     #[inline]
     pub fn append_value(&mut self, value: impl AsRef<[u8]>) {
-        self.builder.values().append_slice(value.as_ref());
-        self.builder.append(true);
-    }
-
-    /// Finish the current variable-length list array slot.
-    #[inline]
-    pub fn append(&mut self, is_valid: bool) {
-        self.builder.append(is_valid)
+        self.value_builder.append_slice(value.as_ref());
+        self.null_buffer_builder.append(true);
+        self.offsets_builder
+            .append(OffsetSize::from_usize(self.value_builder.len()).unwrap());
     }
 
     /// Append a null value to the array.
     #[inline]
     pub fn append_null(&mut self) {
-        self.append(false)
+        self.null_buffer_builder.append(false);

Review Comment:
   An optimisation we use elsewhere is to lazily create the null builder on first null



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org