You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "ariesdevil (via GitHub)" <gi...@apache.org> on 2024/03/07 10:10:56 UTC

[PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

ariesdevil opened a new pull request, #5481:
URL: https://github.com/apache/arrow-rs/pull/5481

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5469 
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   Initially support `StringViewArray` and `BinaryViewArray`, mainly for adding layout and basic construction and tests for these two new types of array.
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   Add two new types of array.
   # Are there any user-facing changes?
   Yes
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "XiangpengHao (via GitHub)" <gi...@apache.org>.
XiangpengHao commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522412442


##########
arrow-data/src/equal/byte_view.rs:
##########
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, ByteView};
+
+pub(super) fn byte_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;
+        let r_len_prefix = *r as u64;
+        // short-circuit, check length and prefix
+        if l_len_prefix != r_len_prefix {

Review Comment:
   oh yes the `validate_view_impl` has checked this, we should be fine then



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1521070038


##########
arrow-data/src/equal/bytes_view.rs:
##########
@@ -0,0 +1,71 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, BytesView};
+
+pub(super) fn bytes_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;

Review Comment:
   The test doesn't encounter this because it constructs the strings in such a way that the buffer views happen to be equal. If you re-arrange things so that this isn't the case, it will fail. 
   
   Perhaps something like (not tested).
   
   ```
       let a1 = StringViewArray::from(vec!["foo", "very long string over 12 bytes", "bar"]);
       let a2 = StringViewArray::from(vec!["a very long string over 12 bytes", "foo", "very long string over 12 bytes", "bar"]);
       test_equal(&a1, &a2.slice(1, 3), true);
   ``` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522288458


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   They're the same thing, the vec is initialized with block size capacity, the only circumstance they diverge is if you write a string larger than the capacity.



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   They're the same thing, the vec is initialized with block size capacity, the only circumstance they diverge is if you write a string larger than the capacity. See the line below



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522262339


##########
arrow-data/src/equal/byte_view.rs:
##########
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, ByteView};
+
+pub(super) fn byte_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;
+        let r_len_prefix = *r as u64;
+        // short-circuit, check length and prefix
+        if l_len_prefix != r_len_prefix {

Review Comment:
   I suspect this is an oversight in the specification, as otherwise it will dramatically slow down comparisons in the small string case. We should probably clarify this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1519729100


##########
arrow-array/src/array/bytes_view_array.rs:
##########
@@ -0,0 +1,385 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericBytesViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::BytesViewType;
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, BytesView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays
+pub struct GenericBytesViewArray<T: BytesViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: BytesViewType + ?Sized> Clone for GenericBytesViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: BytesViewType + ?Sized> GenericBytesViewArray<T> {
+    /// Create a new [`GenericBytesViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericBytesViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericBytesViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [BytesViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,

Review Comment:
   Since the `T` in `ScalarBuffer` is constrained to be impl `ArrowNativeType`, we keep `u128` here. But we provide the `From<u128> for BytesView` to transform each other.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1521261861


##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -0,0 +1,390 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericByteViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, ByteView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays

Review Comment:
   Maybe we can provide a link 
   
   ```suggestion
   /// [Variable-size Binary View Layout]: An array of variable length bytes view arrays.
   ///
   /// Different than [`GenericByteArray`] as it stores both an offset and length
   /// meaning that take / filter operations can be implemented without copying the underlying data.
   ///
   /// [Variable-size Binary View Layout]: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
   ```



##########
arrow-data/src/byte_view.rs:
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_buffer::Buffer;
+use arrow_schema::ArrowError;
+
+#[derive(Debug, Copy, Clone, Default)]
+#[repr(C)]
+pub struct ByteView {
+    /// The length of the string/bytes.
+    pub length: u32,
+    /// First 4 bytes of string/bytes data.
+    pub prefix: u32,
+    /// The buffer index.
+    pub buffer_index: u32,
+    /// The offset into the buffer.
+    pub offset: u32,
+}
+
+impl ByteView {
+    #[inline(always)]
+    pub fn as_u128(self) -> u128 {
+        unsafe { std::mem::transmute(self) }
+    }
+}
+
+impl From<u128> for ByteView {
+    #[inline]
+    fn from(value: u128) -> Self {
+        unsafe { std::mem::transmute(value) }
+    }
+}
+
+impl From<ByteView> for u128 {
+    #[inline]
+    fn from(value: ByteView) -> Self {
+        value.as_u128()
+    }
+}
+
+/// Validates the combination of `views` and `buffers` is a valid BinaryView
+pub fn validate_binary_view(views: &[u128], buffers: &[Buffer]) -> Result<(), ArrowError> {
+    validate_view_impl(views, buffers, |_, _| Ok(()))
+}
+
+/// Validates the combination of `views` and `buffers` is a valid StringView
+pub fn validate_string_view(views: &[u128], buffers: &[Buffer]) -> Result<(), ArrowError> {
+    validate_view_impl(views, buffers, |idx, b| {
+        std::str::from_utf8(b).map_err(|e| {
+            ArrowError::InvalidArgumentError(format!(
+                "Encountered non-UTF-8 data at index {idx}: {e}"
+            ))
+        })?;
+        Ok(())
+    })
+}
+
+fn validate_view_impl<F>(views: &[u128], buffers: &[Buffer], f: F) -> Result<(), ArrowError>

Review Comment:
   As mentioned above, I think we should test this validation code to ensure soundness of the array creation



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -0,0 +1,390 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericByteViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, ByteView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays
+pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: ByteViewType + ?Sized> Clone for GenericByteViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
+    /// Create a new [`GenericByteViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericByteViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [ByteViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Result<Self, ArrowError> {
+        T::validate(&views, &buffers)?;
+
+        if let Some(n) = nulls.as_ref() {
+            if n.len() != views.len() {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Incorrect length of null buffer for {}ViewArray, expected {} got {}",
+                    T::PREFIX,
+                    views.len(),
+                    n.len(),
+                )));
+            }
+        }
+
+        Ok(Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls,
+            phantom: Default::default(),
+        })
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, without validation
+    ///
+    /// # Safety
+    ///
+    /// Safe if [`Self::try_new`] would not error
+    pub unsafe fn new_unchecked(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            phantom: Default::default(),
+            views,
+            buffers,
+            nulls,
+        }
+    }
+
+    /// Create a new [`GenericByteViewArray`] of length `len` where all values are null
+    pub fn new_null(len: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: vec![0; len].into(),
+            buffers: vec![],
+            nulls: Some(NullBuffer::new_null(len)),
+            phantom: Default::default(),
+        }
+    }
+
+    /// Creates a [`GenericByteViewArray`] based on an iterator of values without nulls
+    pub fn from_iter_values<Ptr, I>(iter: I) -> Self
+    where
+        Ptr: AsRef<T::Native>,
+        I: IntoIterator<Item = Ptr>,
+    {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        for v in iter {
+            builder.append_value(v);
+        }
+        builder.finish()
+    }
+
+    /// Deconstruct this array into its constituent parts
+    pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
+        (self.views, self.buffers, self.nulls)
+    }
+
+    /// Returns the views buffer
+    #[inline]
+    pub fn views(&self) -> &ScalarBuffer<u128> {
+        &self.views
+    }
+
+    /// Returns the buffers storing string data
+    #[inline]
+    pub fn data_buffers(&self) -> &[Buffer] {
+        &self.buffers
+    }
+
+    /// Returns the element at index `i`
+    /// # Panics
+    /// Panics if index `i` is out of bounds.
+    pub fn value(&self, i: usize) -> &T::Native {
+        assert!(
+            i < self.len(),
+            "Trying to access an element at index {} from a {}ViewArray of length {}",
+            i,
+            T::PREFIX,
+            self.len()
+        );
+
+        unsafe { self.value_unchecked(i) }
+    }
+
+    /// Returns the element at index `i`
+    /// # Safety
+    /// Caller is responsible for ensuring that the index is within the bounds of the array
+    pub unsafe fn value_unchecked(&self, idx: usize) -> &T::Native {
+        let v = self.views.get_unchecked(idx);
+        let len = *v as u32;
+        let b = if len <= 12 {
+            let ptr = self.views.as_ptr() as *const u8;
+            std::slice::from_raw_parts(ptr.add(idx * 16 + 4), len as usize)
+        } else {
+            let view = ByteView::from(*v);
+            let data = self.buffers.get_unchecked(view.buffer_index as usize);
+            let offset = view.offset as usize;
+            data.get_unchecked(offset..offset + len as usize)
+        };
+        T::Native::from_bytes_unchecked(b)
+    }
+
+    /// constructs a new iterator
+    pub fn iter(&self) -> ArrayIter<&Self> {
+        ArrayIter::new(self)
+    }
+
+    /// Returns a zero-copy slice of this array with the indicated offset and length.
+    pub fn slice(&self, offset: usize, length: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.slice(offset, length),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> Debug for GenericByteViewArray<T> {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}ViewArray\n[\n", T::PREFIX)?;
+        print_long_array(self, f, |array, index, f| {
+            std::fmt::Debug::fmt(&array.value(index), f)
+        })?;
+        write!(f, "]")
+    }
+}
+
+impl<T: ByteViewType + ?Sized> Array for GenericByteViewArray<T> {
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn to_data(&self) -> ArrayData {
+        self.clone().into()
+    }
+
+    fn into_data(self) -> ArrayData {
+        self.into()
+    }
+
+    fn data_type(&self) -> &DataType {
+        &self.data_type
+    }
+
+    fn slice(&self, offset: usize, length: usize) -> ArrayRef {
+        Arc::new(self.slice(offset, length))
+    }
+
+    fn len(&self) -> usize {
+        self.views.len()
+    }
+
+    fn is_empty(&self) -> bool {
+        self.views.is_empty()
+    }
+
+    fn offset(&self) -> usize {
+        0
+    }
+
+    fn nulls(&self) -> Option<&NullBuffer> {
+        self.nulls.as_ref()
+    }
+
+    fn get_buffer_memory_size(&self) -> usize {
+        let mut sum = self.buffers.iter().map(|b| b.capacity()).sum::<usize>();
+        sum += self.views.inner().capacity();
+        if let Some(x) = &self.nulls {
+            sum += x.buffer().capacity()
+        }
+        sum
+    }
+
+    fn get_array_memory_size(&self) -> usize {
+        std::mem::size_of::<Self>() + self.get_buffer_memory_size()
+    }
+}
+
+impl<'a, T: ByteViewType + ?Sized> ArrayAccessor for &'a GenericByteViewArray<T> {
+    type Item = &'a T::Native;
+
+    fn value(&self, index: usize) -> Self::Item {
+        GenericByteViewArray::value(self, index)
+    }
+
+    unsafe fn value_unchecked(&self, index: usize) -> Self::Item {
+        GenericByteViewArray::value_unchecked(self, index)
+    }
+}
+
+impl<'a, T: ByteViewType + ?Sized> IntoIterator for &'a GenericByteViewArray<T> {
+    type Item = Option<&'a T::Native>;
+    type IntoIter = ArrayIter<Self>;
+
+    fn into_iter(self) -> Self::IntoIter {
+        ArrayIter::new(self)
+    }
+}
+
+impl<T: ByteViewType + ?Sized> From<ArrayData> for GenericByteViewArray<T> {
+    fn from(value: ArrayData) -> Self {
+        let views = value.buffers()[0].clone();
+        let views = ScalarBuffer::new(views, value.offset(), value.len());
+        let buffers = value.buffers()[1..].to_vec();
+        Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls: value.nulls().cloned(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> From<GenericByteViewArray<T>> for ArrayData {
+    fn from(mut array: GenericByteViewArray<T>) -> Self {
+        let len = array.len();
+        array.buffers.insert(0, array.views.into_inner());
+        let builder = ArrayDataBuilder::new(T::DATA_TYPE)
+            .len(len)
+            .buffers(array.buffers)
+            .nulls(array.nulls);
+
+        unsafe { builder.build_unchecked() }
+    }
+}
+
+impl<Ptr, T: ByteViewType + ?Sized> FromIterator<Option<Ptr>> for GenericByteViewArray<T>
+where
+    Ptr: AsRef<T::Native>,
+{
+    fn from_iter<I: IntoIterator<Item = Option<Ptr>>>(iter: I) -> Self {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        builder.extend(iter);
+        builder.finish()
+    }
+}
+
+/// A [`GenericByteViewArray`] of `[u8]`
+pub type BinaryViewArray = GenericByteViewArray<BinaryViewType>;
+
+/// A [`GenericByteViewArray`] of `str`
+///
+/// ```
+/// use arrow_array::StringViewArray;
+/// let array = StringViewArray::from_iter_values(vec!["hello", "world", "lulu", "large payload over 12 bytes"]);
+/// assert_eq!(array.value(0), "hello");
+/// assert_eq!(array.value(3), "large payload over 12 bytes");
+/// ```
+pub type StringViewArray = GenericByteViewArray<StringViewType>;
+
+impl From<Vec<&str>> for StringViewArray {
+    fn from(v: Vec<&str>) -> Self {
+        Self::from_iter_values(v)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use crate::builder::StringViewBuilder;
+    use crate::{Array, BinaryViewArray, StringViewArray};
+
+    #[test]
+    fn try_new() {
+        let array = StringViewArray::from_iter_values(vec![
+            "hello",
+            "world",
+            "lulu",
+            "large payload over 12 bytes",
+        ]);
+        assert_eq!(array.value(0), "hello");
+        assert_eq!(array.value(3), "large payload over 12 bytes");
+
+        let array = BinaryViewArray::from_iter_values(vec![
+            b"hello".as_slice(),
+            b"world".as_slice(),
+            b"lulu".as_slice(),
+            b"large payload over 12 bytes".as_slice(),
+        ]);
+        assert_eq!(array.value(0), b"hello");
+        assert_eq!(array.value(3), b"large payload over 12 bytes");
+
+        let array = {
+            let mut builder = StringViewBuilder::new();
+            builder.finish()
+        };
+        assert!(array.is_empty());
+
+        let array = {
+            let mut builder = StringViewBuilder::new();
+            builder.append_value("hello");
+            builder.append_null();
+            builder.append_option(Some("large payload over 12 bytes"));
+            builder.finish()
+        };
+        assert_eq!(array.value(0), "hello");
+        assert!(array.is_null(1));
+        assert_eq!(array.value(2), "large payload over 12 bytes");
+    }
+}

Review Comment:
   Can you also please add negative tests for  `BinaryViewArray::new()` validation failures? Specifically when incorrect data is passed to `::new()` that it should panic with a validation failure -- so tests of the code in `validate_view_impl`:
   * Incorrect buffer counts (
   * Views that point to ranges outside the variable length portion (e.g off the end)
   * views that refer to a non existent variadic buffer
   * StringView that has non utf8 data in it
   
   Maybe we could add this as a follow on PR, but it seems important to ensure the soundness of this code



##########
arrow-array/src/record_batch.rs:
##########
@@ -646,6 +648,30 @@ mod tests {
         check_batch(record_batch, 5)
     }
 
+    #[test]
+    fn create_string_view_record_batch() {

Review Comment:
   ❤️ 



##########
arrow/tests/array_transform.rs:
##########
@@ -1027,6 +1028,44 @@ fn test_extend_nulls_panic() {
     mutable.extend_nulls(2);
 }
 
+#[test]
+fn test_string_view() {
+    let a1 =
+        StringViewArray::from(vec!["foo", "very long string over 12 bytes", "bar"]).into_data();
+    let a2 = StringViewArray::from_iter(vec![
+        Some("bar"),
+        None,
+        Some("long string also over 12 bytes"),
+    ])
+    .into_data();
+
+    a1.validate_full().unwrap();
+    a2.validate_full().unwrap();
+
+    let mut mutable = MutableArrayData::new(vec![&a1, &a2], false, 4);
+    mutable.extend(1, 0, 1);
+    mutable.extend(0, 1, 2);
+    mutable.extend(0, 0, 1);
+    mutable.extend(1, 2, 3);
+
+    let array = StringViewArray::from(mutable.freeze());
+    assert_eq!(array.data_buffers().len(), 2);
+    // Should have reused data buffers

Review Comment:
   nice



##########
arrow/tests/array_equal.rs:
##########
@@ -307,6 +307,30 @@ fn test_fixed_size_binary_array() {
     test_equal(&a, &b, true);
 }
 
+#[test]
+fn test_string_view_equal() {
+    let a1 = StringViewArray::from(vec!["foo", "very long string over 12 bytes", "bar"]);
+    let a2 = StringViewArray::from(vec![
+        "a very long string over 12 bytes",
+        "foo",
+        "very long string over 12 bytes",
+        "bar",
+    ]);
+    test_equal(&a1, &a2.slice(1, 3), true);
+
+    let a1 = StringViewArray::from(vec!["foo", "very long string over 12 bytes", "bar"]);
+    let a2 = StringViewArray::from(vec!["foo", "very long string over 12 bytes", "bar"]);
+    test_equal(&a1, &a2, true);
+
+    let a1_s = a1.slice(1, 1);
+    let a2_s = a2.slice(1, 1);
+    test_equal(&a1_s, &a2_s, true);
+
+    let a1_s = a1.slice(2, 1);
+    let a2_s = a2.slice(0, 1);
+    test_equal(&a1_s, &a2_s, false);

Review Comment:
   I think the following additional tests are important:
   1. view arrays that differ only in nullability (should not be equal)
   2. Equality / slices with nulls 



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {

Review Comment:
   With a views builder maybe this could look like
   
   ```rust
   self.views_builder.append_inline(v, length)
   ```



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -0,0 +1,390 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericByteViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, ByteView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays
+pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: ByteViewType + ?Sized> Clone for GenericByteViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
+    /// Create a new [`GenericByteViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericByteViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [ByteViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Result<Self, ArrowError> {
+        T::validate(&views, &buffers)?;
+
+        if let Some(n) = nulls.as_ref() {
+            if n.len() != views.len() {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Incorrect length of null buffer for {}ViewArray, expected {} got {}",
+                    T::PREFIX,
+                    views.len(),
+                    n.len(),
+                )));
+            }
+        }
+
+        Ok(Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls,
+            phantom: Default::default(),
+        })
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, without validation
+    ///
+    /// # Safety
+    ///
+    /// Safe if [`Self::try_new`] would not error
+    pub unsafe fn new_unchecked(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            phantom: Default::default(),
+            views,
+            buffers,
+            nulls,
+        }
+    }
+
+    /// Create a new [`GenericByteViewArray`] of length `len` where all values are null
+    pub fn new_null(len: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: vec![0; len].into(),
+            buffers: vec![],
+            nulls: Some(NullBuffer::new_null(len)),
+            phantom: Default::default(),
+        }
+    }
+
+    /// Creates a [`GenericByteViewArray`] based on an iterator of values without nulls
+    pub fn from_iter_values<Ptr, I>(iter: I) -> Self
+    where
+        Ptr: AsRef<T::Native>,
+        I: IntoIterator<Item = Ptr>,
+    {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        for v in iter {
+            builder.append_value(v);
+        }
+        builder.finish()
+    }
+
+    /// Deconstruct this array into its constituent parts
+    pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
+        (self.views, self.buffers, self.nulls)
+    }
+
+    /// Returns the views buffer
+    #[inline]
+    pub fn views(&self) -> &ScalarBuffer<u128> {

Review Comment:
   Not for this PR, but I wonder if we should consider implementing a `ByteViewBuffer` type, similarly to the `OffsetBuffer` used for `GenericBinaryView` -- https://docs.rs/arrow/latest/arrow/buffer/struct.OffsetBuffer.html
   
   I thought that the introduction of `OffsetBuffer` made working with StringArray/BinaryArray much easier. 
   
   I can imagine `ByteViewBuffer` encapsulating the 12 byte inline string calculation, as well as building such values up as well as hosting documentation explaining what types are present.
   
   If this seems like a reasonable idea, I can write up a ticket / maybe whack up a PR to show what it might look like
   



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   I may have missed it, but is there test coverage for this case (when the existing binary buffer fills up and we need a new variable length buffer)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#issuecomment-1998085138

   🚀  I filed a bunch of follow on tickets:
   IPC format support for StringViewArray and BinaryViewArray #5506
    https://github.com/apache/arrow-rs/issues/5507
    https://github.com/apache/arrow-rs/issues/5508
    https://github.com/apache/arrow-rs/issues/5509
    https://github.com/apache/arrow-rs/issues/5510
    https://github.com/apache/arrow-rs/issues/5511
    https://github.com/apache/arrow-rs/issues/5513


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522259241


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data

Review Comment:
   It is technically a minimum, as if a single string exceeds the entire buffer capacity it will allocate a fresh buffer with that size



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522288458


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   They're the same thing, the vec is initialized with block size capacity



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "amunra (via GitHub)" <gi...@apache.org>.
amunra commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1516576696


##########
arrow-array/src/array/bytes_view_array.rs:
##########
@@ -0,0 +1,385 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericBytesViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::BytesViewType;
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, BytesView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays
+pub struct GenericBytesViewArray<T: BytesViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: BytesViewType + ?Sized> Clone for GenericBytesViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: BytesViewType + ?Sized> GenericBytesViewArray<T> {
+    /// Create a new [`GenericBytesViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericBytesViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericBytesViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [BytesViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,

Review Comment:
   Just passing through, but I'm wondering if instead of a `u128` the final PR will have some actual type here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#issuecomment-1991673695

   Hi @alamb , I added more tests for your kindly comments, PTAL again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1520961985


##########
arrow-data/src/equal/bytes_view.rs:
##########
@@ -0,0 +1,71 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, BytesView};
+
+pub(super) fn bytes_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;

Review Comment:
   I think this is correct, I added some tests for this in `arrow/tests/array_equal.rs`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1520462566


##########
arrow-array/src/types.rs:
##########
@@ -1544,6 +1546,101 @@ pub type BinaryType = GenericBinaryType<i32>;
 /// An arrow binary array with i64 offsets
 pub type LargeBinaryType = GenericBinaryType<i64>;
 
+mod bytes_view {
+    pub trait Sealed: Send + Sync {}
+    impl Sealed for str {}
+    impl Sealed for [u8] {}
+}
+
+/// A trait over the variable length bytes view array types
+pub trait BytesViewType: bytes_view::Sealed + 'static + PartialEq + AsRef<Self> {
+    /// If element in array is utf8 encoded string.
+    const IS_UTF8: bool;
+
+    /// Datatype of array elements
+    const DATA_TYPE: DataType = if Self::IS_UTF8 {
+        DataType::Utf8View
+    } else {
+        DataType::BinaryView
+    };
+
+    /// "Binary" or "String", for use in displayed or error messages
+    const PREFIX: &'static str;
+
+    /// Type for representing its equivalent rust type i.e
+    /// Utf8Array will have native type has &str
+    /// BinaryArray will have type as [u8]
+    type Native: bytes::ByteArrayNativeType + AsRef<Self::Native> + AsRef<[u8]> + ?Sized;
+
+    /// Type for owned corresponding to `Native`
+    type Owned: Debug + Clone + Sync + Send + AsRef<Self>;
+
+    /// # Safety
+    /// The caller must ensure `index < self.len()`.
+    unsafe fn from_bytes_unchecked(slice: &[u8]) -> &Self;
+
+    /// To bytes slice.
+    fn to_bytes(&self) -> &[u8];
+
+    /// To owned type
+    #[allow(clippy::wrong_self_convention)]
+    fn into_owned(&self) -> Self::Owned;
+
+    /// Verifies that the provided buffers are valid for this array type
+    fn validate(views: &[u128], buffers: &[Buffer]) -> Result<(), ArrowError>;
+}
+
+impl BytesViewType for str {

Review Comment:
   I think it would be more consistent to define a `BinaryViewType` and `StringViewType`. Whilst I can see the appeal of defining traits using the underlying representation, there are advantages to adding a layer of indirection via type structs, in particular these traits were significantly simpler in #4585 



##########
arrow-data/Cargo.toml:
##########
@@ -51,6 +51,7 @@ arrow-schema = { workspace = true }
 
 num = { version = "0.4", default-features = false, features = ["std"] }
 half = { version = "2.1", default-features = false }
+simdutf8 = { version = "0.1.4", default-features = false, features = ["std", "aarch64_neon"] }

Review Comment:
   As this is a new dependency I think I would prefer to split this out into a separate "performance enhancement" where we can asses the improvement/regression



##########
arrow-data/src/data.rs:
##########
@@ -1590,6 +1580,11 @@ pub struct DataTypeLayout {
 
     /// Can contain a null bitmask
     pub can_contain_null_mask: bool,
+
+    /// This field only applies to the view type,[`DataType::BinaryView`] and [`DataType::Utf8View`]

Review Comment:
   ```suggestion
       /// This field only applies to the view type [`DataType::BinaryView`] and [`DataType::Utf8View`]
   ```



##########
arrow-data/src/equal/bytes_view.rs:
##########
@@ -0,0 +1,71 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, BytesView};
+
+pub(super) fn bytes_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;

Review Comment:
   This has been changed from #4585 and I believe is incorrect. In particular it will potentially incorrectly return not equal for non-inline strings



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1521173351


##########
arrow-data/src/equal/bytes_view.rs:
##########
@@ -0,0 +1,71 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, BytesView};
+
+pub(super) fn bytes_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;

Review Comment:
   I added this case and it is still correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522285749


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   Then I don't understand the difference between capacity and block_size, though maybe I lost some subtely
   
   At the very least I think  we should document the behavior of the two settings
   
   I would expect:
   1. `capacity` to act like `Vec::with_capacity` that gives me a way to pre-declare how much buffer I want to pre-allocate, but if I put more data in then the internal allocation will grow
   2. `block_size` to act as the target size for the variable length buffer -- when the buffer would grow larger than `block_size` a new buffer is allocated
   
   I don't think this is what the code does -- so I think we should clarify what it does do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522257487


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   This is intentional to avoid bump allocating



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522258477


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data

Review Comment:
   ```suggestion
       /// Override the size of buffers to allocate for holding string data
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522288458


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   They're the same thing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1523656743


##########
arrow-data/src/equal/byte_view.rs:
##########
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, ByteView};
+
+pub(super) fn byte_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;
+        let r_len_prefix = *r as u64;
+        // short-circuit, check length and prefix
+        if l_len_prefix != r_len_prefix {

Review Comment:
   The spec does say "padded with zero" https://github.com/apache/arrow/blob/main/docs/source/format/Columnar.rst?plain=1#L384 but it could be repeated in the surrounding paragraph. In any case, padded with zero is definitely the intent



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "thinkharderdev (via GitHub)" <gi...@apache.org>.
thinkharderdev commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522019986


##########
arrow-data/src/byte_view.rs:
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_buffer::Buffer;
+use arrow_schema::ArrowError;
+
+#[derive(Debug, Copy, Clone, Default)]
+#[repr(C)]
+pub struct ByteView {
+    /// The length of the string/bytes.
+    pub length: u32,
+    /// First 4 bytes of string/bytes data.
+    pub prefix: u32,
+    /// The buffer index.
+    pub buffer_index: u32,
+    /// The offset into the buffer.
+    pub offset: u32,
+}
+
+impl ByteView {
+    #[inline(always)]
+    pub fn as_u128(self) -> u128 {
+        unsafe { std::mem::transmute(self) }

Review Comment:
   isn't this invalid on big-endian platforms?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#issuecomment-1993288922

   Hi @alamb @tustvold , I modified the code as you guys suggested and added more tests, PTAL again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522258477


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data

Review Comment:
   ```suggestion
       /// Override the size of buffers to allocate for holding string data
       ///
       /// Note: if a single string value exceeds this, a buffer will be allocated for just this string value
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1523766152


##########
arrow-data/src/equal/byte_view.rs:
##########
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, ByteView};
+
+pub(super) fn byte_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;
+        let r_len_prefix = *r as u64;
+        // short-circuit, check length and prefix
+        if l_len_prefix != r_len_prefix {

Review Comment:
   
   > The spec does say "padded with zero" https://github.com/apache/arrow/blob/main/docs/source/format/Columnar.rst?plain=1#L384 but it could be repeated in the surrounding paragraph. In any case, padded with zero is definitely the intent
   
   Good idea -- filed https://github.com/apache/arrow/pull/40512 upstream in arrow to clarify this point
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522068729


##########
arrow-data/src/byte_view.rs:
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_buffer::Buffer;
+use arrow_schema::ArrowError;
+
+#[derive(Debug, Copy, Clone, Default)]
+#[repr(C)]
+pub struct ByteView {
+    /// The length of the string/bytes.
+    pub length: u32,
+    /// First 4 bytes of string/bytes data.
+    pub prefix: u32,
+    /// The buffer index.
+    pub buffer_index: u32,
+    /// The offset into the buffer.
+    pub offset: u32,
+}
+
+impl ByteView {
+    #[inline(always)]
+    pub fn as_u128(self) -> u128 {
+        unsafe { std::mem::transmute(self) }

Review Comment:
   Quite possibly, the original PR wrote the shifts out and relied on the compiler to be smart enough (which it is)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522306497


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   As a user I don't want to have to read to code to understand what the settings mean. If we can make this simpler that would be good -- we can do this as a follow on PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522898796


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   I was referring to these APIs
   
   ```rust
   impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
   ...
       /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` string values.
       pub fn with_capacity(capacity: usize) -> Self {
   ...
       }
   
       /// Override the size of buffers to allocate for holding string data
       pub fn with_block_size(self, block_size: u32) -> Self {
   ...
       }
   ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522151584


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   It seems like this will grow when the capacity is exceeded -- I would expect that it only creates a new in_progress buffer when the block_size is exceeded, because the capacity is a size hint.



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data

Review Comment:
   Is block size more like the "maximum size of buffers to allocate"? Maybe I am mis understanding the intent



##########
arrow-data/src/equal/byte_view.rs:
##########
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, ByteView};
+
+pub(super) fn byte_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;
+        let r_len_prefix = *r as u64;
+        // short-circuit, check length and prefix
+        if l_len_prefix != r_len_prefix {

Review Comment:
   The spec does not specify what the padding byte are from what I can tell: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
   
   However, the `validate_view_impl` routines check for zero padding.
   
   Maybe we we can move this check after the check for `len <= 12` 



##########
arrow-data/src/byte_view.rs:
##########
@@ -0,0 +1,115 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_buffer::Buffer;
+use arrow_schema::ArrowError;
+
+#[derive(Debug, Copy, Clone, Default)]
+#[repr(C)]
+pub struct ByteView {
+    /// The length of the string/bytes.
+    pub length: u32,
+    /// First 4 bytes of string/bytes data.
+    pub prefix: u32,
+    /// The buffer index.
+    pub buffer_index: u32,
+    /// The offset into the buffer.
+    pub offset: u32,
+}
+
+impl ByteView {
+    #[inline(always)]
+    pub fn as_u128(self) -> u128 {
+        unsafe { std::mem::transmute(self) }

Review Comment:
   So the recommendation is to change this back to assembling the `u128` via shifts from the individual fields?



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -0,0 +1,442 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericByteViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, ByteView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// [Variable-size Binary View Layout]: An array of variable length bytes view arrays.
+///
+/// Different than [`GenericByteArray`] as it stores both an offset and length
+/// meaning that take / filter operations can be implemented without copying the underlying data.
+///
+/// [Variable-size Binary View Layout]: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
+pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: ByteViewType + ?Sized> Clone for GenericByteViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
+    /// Create a new [`GenericByteViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericByteViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [ByteViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Result<Self, ArrowError> {
+        T::validate(&views, &buffers)?;
+
+        if let Some(n) = nulls.as_ref() {
+            if n.len() != views.len() {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Incorrect length of null buffer for {}ViewArray, expected {} got {}",
+                    T::PREFIX,
+                    views.len(),
+                    n.len(),
+                )));
+            }
+        }
+
+        Ok(Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls,
+            phantom: Default::default(),
+        })
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, without validation
+    ///
+    /// # Safety
+    ///
+    /// Safe if [`Self::try_new`] would not error
+    pub unsafe fn new_unchecked(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            phantom: Default::default(),
+            views,
+            buffers,
+            nulls,
+        }
+    }
+
+    /// Create a new [`GenericByteViewArray`] of length `len` where all values are null
+    pub fn new_null(len: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: vec![0; len].into(),
+            buffers: vec![],
+            nulls: Some(NullBuffer::new_null(len)),
+            phantom: Default::default(),
+        }
+    }
+
+    /// Creates a [`GenericByteViewArray`] based on an iterator of values without nulls
+    pub fn from_iter_values<Ptr, I>(iter: I) -> Self
+    where
+        Ptr: AsRef<T::Native>,
+        I: IntoIterator<Item = Ptr>,
+    {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        for v in iter {
+            builder.append_value(v);
+        }
+        builder.finish()
+    }
+
+    /// Deconstruct this array into its constituent parts
+    pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
+        (self.views, self.buffers, self.nulls)
+    }
+
+    /// Returns the views buffer
+    #[inline]
+    pub fn views(&self) -> &ScalarBuffer<u128> {
+        &self.views
+    }
+
+    /// Returns the buffers storing string data
+    #[inline]
+    pub fn data_buffers(&self) -> &[Buffer] {
+        &self.buffers
+    }
+
+    /// Returns the element at index `i`
+    /// # Panics
+    /// Panics if index `i` is out of bounds.
+    pub fn value(&self, i: usize) -> &T::Native {
+        assert!(
+            i < self.len(),
+            "Trying to access an element at index {} from a {}ViewArray of length {}",
+            i,
+            T::PREFIX,
+            self.len()
+        );
+
+        unsafe { self.value_unchecked(i) }
+    }
+
+    /// Returns the element at index `i`
+    /// # Safety
+    /// Caller is responsible for ensuring that the index is within the bounds of the array
+    pub unsafe fn value_unchecked(&self, idx: usize) -> &T::Native {
+        let v = self.views.get_unchecked(idx);
+        let len = *v as u32;
+        let b = if len <= 12 {
+            let ptr = self.views.as_ptr() as *const u8;
+            std::slice::from_raw_parts(ptr.add(idx * 16 + 4), len as usize)
+        } else {
+            let view = ByteView::from(*v);
+            let data = self.buffers.get_unchecked(view.buffer_index as usize);
+            let offset = view.offset as usize;
+            data.get_unchecked(offset..offset + len as usize)
+        };
+        T::Native::from_bytes_unchecked(b)
+    }
+
+    /// constructs a new iterator
+    pub fn iter(&self) -> ArrayIter<&Self> {
+        ArrayIter::new(self)
+    }
+
+    /// Returns a zero-copy slice of this array with the indicated offset and length.
+    pub fn slice(&self, offset: usize, length: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.slice(offset, length),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> Debug for GenericByteViewArray<T> {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}ViewArray\n[\n", T::PREFIX)?;
+        print_long_array(self, f, |array, index, f| {
+            std::fmt::Debug::fmt(&array.value(index), f)
+        })?;
+        write!(f, "]")
+    }
+}
+
+impl<T: ByteViewType + ?Sized> Array for GenericByteViewArray<T> {
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn to_data(&self) -> ArrayData {
+        self.clone().into()
+    }
+
+    fn into_data(self) -> ArrayData {
+        self.into()
+    }
+
+    fn data_type(&self) -> &DataType {
+        &self.data_type
+    }
+
+    fn slice(&self, offset: usize, length: usize) -> ArrayRef {
+        Arc::new(self.slice(offset, length))
+    }
+
+    fn len(&self) -> usize {
+        self.views.len()
+    }
+
+    fn is_empty(&self) -> bool {
+        self.views.is_empty()
+    }
+
+    fn offset(&self) -> usize {
+        0
+    }
+
+    fn nulls(&self) -> Option<&NullBuffer> {
+        self.nulls.as_ref()
+    }
+
+    fn get_buffer_memory_size(&self) -> usize {
+        let mut sum = self.buffers.iter().map(|b| b.capacity()).sum::<usize>();
+        sum += self.views.inner().capacity();
+        if let Some(x) = &self.nulls {
+            sum += x.buffer().capacity()
+        }
+        sum
+    }
+
+    fn get_array_memory_size(&self) -> usize {
+        std::mem::size_of::<Self>() + self.get_buffer_memory_size()
+    }
+}
+
+impl<'a, T: ByteViewType + ?Sized> ArrayAccessor for &'a GenericByteViewArray<T> {
+    type Item = &'a T::Native;
+
+    fn value(&self, index: usize) -> Self::Item {
+        GenericByteViewArray::value(self, index)
+    }
+
+    unsafe fn value_unchecked(&self, index: usize) -> Self::Item {
+        GenericByteViewArray::value_unchecked(self, index)
+    }
+}
+
+impl<'a, T: ByteViewType + ?Sized> IntoIterator for &'a GenericByteViewArray<T> {
+    type Item = Option<&'a T::Native>;
+    type IntoIter = ArrayIter<Self>;
+
+    fn into_iter(self) -> Self::IntoIter {
+        ArrayIter::new(self)
+    }
+}
+
+impl<T: ByteViewType + ?Sized> From<ArrayData> for GenericByteViewArray<T> {
+    fn from(value: ArrayData) -> Self {
+        let views = value.buffers()[0].clone();
+        let views = ScalarBuffer::new(views, value.offset(), value.len());
+        let buffers = value.buffers()[1..].to_vec();
+        Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls: value.nulls().cloned(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> From<GenericByteViewArray<T>> for ArrayData {
+    fn from(mut array: GenericByteViewArray<T>) -> Self {
+        let len = array.len();
+        array.buffers.insert(0, array.views.into_inner());
+        let builder = ArrayDataBuilder::new(T::DATA_TYPE)
+            .len(len)
+            .buffers(array.buffers)
+            .nulls(array.nulls);
+
+        unsafe { builder.build_unchecked() }
+    }
+}
+
+impl<Ptr, T: ByteViewType + ?Sized> FromIterator<Option<Ptr>> for GenericByteViewArray<T>
+where
+    Ptr: AsRef<T::Native>,
+{
+    fn from_iter<I: IntoIterator<Item = Option<Ptr>>>(iter: I) -> Self {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        builder.extend(iter);
+        builder.finish()
+    }
+}
+
+/// A [`GenericByteViewArray`] of `[u8]`
+pub type BinaryViewArray = GenericByteViewArray<BinaryViewType>;
+
+/// A [`GenericByteViewArray`] of `str`
+///
+/// ```
+/// use arrow_array::StringViewArray;
+/// let array = StringViewArray::from_iter_values(vec!["hello", "world", "lulu", "large payload over 12 bytes"]);
+/// assert_eq!(array.value(0), "hello");
+/// assert_eq!(array.value(3), "large payload over 12 bytes");
+/// ```
+pub type StringViewArray = GenericByteViewArray<StringViewType>;
+
+impl From<Vec<&str>> for StringViewArray {
+    fn from(v: Vec<&str>) -> Self {
+        Self::from_iter_values(v)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use crate::builder::StringViewBuilder;
+    use crate::{Array, BinaryViewArray, StringViewArray};
+    use arrow_buffer::{Buffer, ScalarBuffer};
+    use arrow_data::ByteView;
+
+    #[test]
+    fn try_new() {
+        let array = StringViewArray::from_iter_values(vec![
+            "hello",
+            "world",
+            "lulu",
+            "large payload over 12 bytes",
+        ]);
+        assert_eq!(array.value(0), "hello");
+        assert_eq!(array.value(3), "large payload over 12 bytes");
+
+        let array = BinaryViewArray::from_iter_values(vec![
+            b"hello".as_slice(),
+            b"world".as_slice(),
+            b"lulu".as_slice(),
+            b"large payload over 12 bytes".as_slice(),
+        ]);
+        assert_eq!(array.value(0), b"hello");
+        assert_eq!(array.value(3), b"large payload over 12 bytes");
+
+        // test empty array
+        let array = {
+            let mut builder = StringViewBuilder::new();
+            builder.finish()
+        };
+        assert!(array.is_empty());
+
+        // test builder append
+        let array = {
+            let mut builder = StringViewBuilder::new();
+            builder.append_value("hello");
+            builder.append_null();
+            builder.append_option(Some("large payload over 12 bytes"));
+            builder.finish()
+        };
+        assert_eq!(array.value(0), "hello");
+        assert!(array.is_null(1));
+        assert_eq!(array.value(2), "large payload over 12 bytes");
+
+        // test builder's in_progress re-created
+        let array = {
+            // make a builder with small block size.
+            let mut builder = StringViewBuilder::new().with_block_size(14);
+            builder.append_value("large payload over 12 bytes");
+            builder.append_option(Some("another large payload over 12 bytes that double than the first one, so that we can trigger the in_progress in builder re-created"));
+            builder.finish()
+        };
+        assert_eq!(array.value(0), "large payload over 12 bytes");
+        assert_eq!(array.value(1), "another large payload over 12 bytes that double than the first one, so that we can trigger the in_progress in builder re-created");

Review Comment:
   Can you also please assert in this test that the array actually has 2 data buffers (not one)?



##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -0,0 +1,442 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericByteViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, ByteView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// [Variable-size Binary View Layout]: An array of variable length bytes view arrays.
+///
+/// Different than [`GenericByteArray`] as it stores both an offset and length
+/// meaning that take / filter operations can be implemented without copying the underlying data.
+///
+/// [Variable-size Binary View Layout]: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
+pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: ByteViewType + ?Sized> Clone for GenericByteViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
+    /// Create a new [`GenericByteViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericByteViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [ByteViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Result<Self, ArrowError> {
+        T::validate(&views, &buffers)?;
+
+        if let Some(n) = nulls.as_ref() {
+            if n.len() != views.len() {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Incorrect length of null buffer for {}ViewArray, expected {} got {}",
+                    T::PREFIX,
+                    views.len(),
+                    n.len(),
+                )));
+            }
+        }
+
+        Ok(Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls,
+            phantom: Default::default(),
+        })
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, without validation
+    ///
+    /// # Safety
+    ///
+    /// Safe if [`Self::try_new`] would not error
+    pub unsafe fn new_unchecked(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            phantom: Default::default(),
+            views,
+            buffers,
+            nulls,
+        }
+    }
+
+    /// Create a new [`GenericByteViewArray`] of length `len` where all values are null
+    pub fn new_null(len: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: vec![0; len].into(),
+            buffers: vec![],
+            nulls: Some(NullBuffer::new_null(len)),
+            phantom: Default::default(),
+        }
+    }
+
+    /// Creates a [`GenericByteViewArray`] based on an iterator of values without nulls
+    pub fn from_iter_values<Ptr, I>(iter: I) -> Self
+    where
+        Ptr: AsRef<T::Native>,
+        I: IntoIterator<Item = Ptr>,
+    {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        for v in iter {
+            builder.append_value(v);
+        }
+        builder.finish()
+    }
+
+    /// Deconstruct this array into its constituent parts
+    pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
+        (self.views, self.buffers, self.nulls)
+    }
+
+    /// Returns the views buffer
+    #[inline]
+    pub fn views(&self) -> &ScalarBuffer<u128> {
+        &self.views
+    }
+
+    /// Returns the buffers storing string data
+    #[inline]
+    pub fn data_buffers(&self) -> &[Buffer] {
+        &self.buffers
+    }
+
+    /// Returns the element at index `i`
+    /// # Panics
+    /// Panics if index `i` is out of bounds.
+    pub fn value(&self, i: usize) -> &T::Native {
+        assert!(
+            i < self.len(),
+            "Trying to access an element at index {} from a {}ViewArray of length {}",
+            i,
+            T::PREFIX,
+            self.len()
+        );
+
+        unsafe { self.value_unchecked(i) }
+    }
+
+    /// Returns the element at index `i`
+    /// # Safety
+    /// Caller is responsible for ensuring that the index is within the bounds of the array
+    pub unsafe fn value_unchecked(&self, idx: usize) -> &T::Native {
+        let v = self.views.get_unchecked(idx);
+        let len = *v as u32;
+        let b = if len <= 12 {
+            let ptr = self.views.as_ptr() as *const u8;
+            std::slice::from_raw_parts(ptr.add(idx * 16 + 4), len as usize)
+        } else {
+            let view = ByteView::from(*v);
+            let data = self.buffers.get_unchecked(view.buffer_index as usize);
+            let offset = view.offset as usize;
+            data.get_unchecked(offset..offset + len as usize)
+        };
+        T::Native::from_bytes_unchecked(b)
+    }
+
+    /// constructs a new iterator
+    pub fn iter(&self) -> ArrayIter<&Self> {
+        ArrayIter::new(self)
+    }
+
+    /// Returns a zero-copy slice of this array with the indicated offset and length.
+    pub fn slice(&self, offset: usize, length: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.slice(offset, length),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.as_ref().map(|n| n.slice(offset, length)),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> Debug for GenericByteViewArray<T> {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(f, "{}ViewArray\n[\n", T::PREFIX)?;
+        print_long_array(self, f, |array, index, f| {
+            std::fmt::Debug::fmt(&array.value(index), f)
+        })?;
+        write!(f, "]")
+    }
+}
+
+impl<T: ByteViewType + ?Sized> Array for GenericByteViewArray<T> {
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn to_data(&self) -> ArrayData {
+        self.clone().into()
+    }
+
+    fn into_data(self) -> ArrayData {
+        self.into()
+    }
+
+    fn data_type(&self) -> &DataType {
+        &self.data_type
+    }
+
+    fn slice(&self, offset: usize, length: usize) -> ArrayRef {
+        Arc::new(self.slice(offset, length))
+    }
+
+    fn len(&self) -> usize {
+        self.views.len()
+    }
+
+    fn is_empty(&self) -> bool {
+        self.views.is_empty()
+    }
+
+    fn offset(&self) -> usize {
+        0
+    }
+
+    fn nulls(&self) -> Option<&NullBuffer> {
+        self.nulls.as_ref()
+    }
+
+    fn get_buffer_memory_size(&self) -> usize {
+        let mut sum = self.buffers.iter().map(|b| b.capacity()).sum::<usize>();
+        sum += self.views.inner().capacity();
+        if let Some(x) = &self.nulls {
+            sum += x.buffer().capacity()
+        }
+        sum
+    }
+
+    fn get_array_memory_size(&self) -> usize {
+        std::mem::size_of::<Self>() + self.get_buffer_memory_size()
+    }
+}
+
+impl<'a, T: ByteViewType + ?Sized> ArrayAccessor for &'a GenericByteViewArray<T> {
+    type Item = &'a T::Native;
+
+    fn value(&self, index: usize) -> Self::Item {
+        GenericByteViewArray::value(self, index)
+    }
+
+    unsafe fn value_unchecked(&self, index: usize) -> Self::Item {
+        GenericByteViewArray::value_unchecked(self, index)
+    }
+}
+
+impl<'a, T: ByteViewType + ?Sized> IntoIterator for &'a GenericByteViewArray<T> {
+    type Item = Option<&'a T::Native>;
+    type IntoIter = ArrayIter<Self>;
+
+    fn into_iter(self) -> Self::IntoIter {
+        ArrayIter::new(self)
+    }
+}
+
+impl<T: ByteViewType + ?Sized> From<ArrayData> for GenericByteViewArray<T> {
+    fn from(value: ArrayData) -> Self {
+        let views = value.buffers()[0].clone();
+        let views = ScalarBuffer::new(views, value.offset(), value.len());
+        let buffers = value.buffers()[1..].to_vec();
+        Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls: value.nulls().cloned(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> From<GenericByteViewArray<T>> for ArrayData {
+    fn from(mut array: GenericByteViewArray<T>) -> Self {
+        let len = array.len();
+        array.buffers.insert(0, array.views.into_inner());
+        let builder = ArrayDataBuilder::new(T::DATA_TYPE)
+            .len(len)
+            .buffers(array.buffers)
+            .nulls(array.nulls);
+
+        unsafe { builder.build_unchecked() }
+    }
+}
+
+impl<Ptr, T: ByteViewType + ?Sized> FromIterator<Option<Ptr>> for GenericByteViewArray<T>
+where
+    Ptr: AsRef<T::Native>,
+{
+    fn from_iter<I: IntoIterator<Item = Option<Ptr>>>(iter: I) -> Self {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        builder.extend(iter);
+        builder.finish()
+    }
+}
+
+/// A [`GenericByteViewArray`] of `[u8]`
+pub type BinaryViewArray = GenericByteViewArray<BinaryViewType>;
+
+/// A [`GenericByteViewArray`] of `str`
+///
+/// ```
+/// use arrow_array::StringViewArray;
+/// let array = StringViewArray::from_iter_values(vec!["hello", "world", "lulu", "large payload over 12 bytes"]);
+/// assert_eq!(array.value(0), "hello");
+/// assert_eq!(array.value(3), "large payload over 12 bytes");
+/// ```
+pub type StringViewArray = GenericByteViewArray<StringViewType>;
+
+impl From<Vec<&str>> for StringViewArray {
+    fn from(v: Vec<&str>) -> Self {
+        Self::from_iter_values(v)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use crate::builder::StringViewBuilder;
+    use crate::{Array, BinaryViewArray, StringViewArray};
+    use arrow_buffer::{Buffer, ScalarBuffer};
+    use arrow_data::ByteView;
+
+    #[test]
+    fn try_new() {
+        let array = StringViewArray::from_iter_values(vec![
+            "hello",
+            "world",
+            "lulu",
+            "large payload over 12 bytes",
+        ]);
+        assert_eq!(array.value(0), "hello");
+        assert_eq!(array.value(3), "large payload over 12 bytes");
+
+        let array = BinaryViewArray::from_iter_values(vec![
+            b"hello".as_slice(),
+            b"world".as_slice(),
+            b"lulu".as_slice(),
+            b"large payload over 12 bytes".as_slice(),
+        ]);
+        assert_eq!(array.value(0), b"hello");
+        assert_eq!(array.value(3), b"large payload over 12 bytes");
+
+        // test empty array
+        let array = {
+            let mut builder = StringViewBuilder::new();
+            builder.finish()
+        };
+        assert!(array.is_empty());
+
+        // test builder append
+        let array = {
+            let mut builder = StringViewBuilder::new();
+            builder.append_value("hello");
+            builder.append_null();
+            builder.append_option(Some("large payload over 12 bytes"));
+            builder.finish()
+        };
+        assert_eq!(array.value(0), "hello");
+        assert!(array.is_null(1));
+        assert_eq!(array.value(2), "large payload over 12 bytes");
+
+        // test builder's in_progress re-created
+        let array = {
+            // make a builder with small block size.
+            let mut builder = StringViewBuilder::new().with_block_size(14);
+            builder.append_value("large payload over 12 bytes");
+            builder.append_option(Some("another large payload over 12 bytes that double than the first one, so that we can trigger the in_progress in builder re-created"));
+            builder.finish()
+        };
+        assert_eq!(array.value(0), "large payload over 12 bytes");
+        assert_eq!(array.value(1), "another large payload over 12 bytes that double than the first one, so that we can trigger the in_progress in builder re-created");
+    }
+
+    #[test]
+    #[should_panic(expected = "Invalid buffer index at 0: got index 3 but only has 1 buffers")]

Review Comment:
   I think there are some more tests needed to cover `validate_view_impl`:
   1. non zero padding "View at index {idx} contained non-zero padding for string of length {len}"
   2. Mismatch between embedded prefix and data
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522307370


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   Can you point me to where we define capacity in the public API with regards to string data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1517108988


##########
arrow-array/src/array/bytes_view_array.rs:
##########
@@ -0,0 +1,385 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericBytesViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::BytesViewType;
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, BytesView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays
+pub struct GenericBytesViewArray<T: BytesViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: BytesViewType + ?Sized> Clone for GenericBytesViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: BytesViewType + ?Sized> GenericBytesViewArray<T> {
+    /// Create a new [`GenericBytesViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericBytesViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericBytesViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [BytesViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,

Review Comment:
   Yup, a `View` type seems better



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "ariesdevil (via GitHub)" <gi...@apache.org>.
ariesdevil commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1521480305


##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -0,0 +1,390 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::array::print_long_array;
+use crate::builder::GenericByteViewBuilder;
+use crate::iterator::ArrayIter;
+use crate::types::bytes::ByteArrayNativeType;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{Array, ArrayAccessor, ArrayRef};
+use arrow_buffer::{Buffer, NullBuffer, ScalarBuffer};
+use arrow_data::{ArrayData, ArrayDataBuilder, ByteView};
+use arrow_schema::{ArrowError, DataType};
+use std::any::Any;
+use std::fmt::Debug;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+/// An array of variable length bytes view arrays
+pub struct GenericByteViewArray<T: ByteViewType + ?Sized> {
+    data_type: DataType,
+    views: ScalarBuffer<u128>,
+    buffers: Vec<Buffer>,
+    phantom: PhantomData<T>,
+    nulls: Option<NullBuffer>,
+}
+
+impl<T: ByteViewType + ?Sized> Clone for GenericByteViewArray<T> {
+    fn clone(&self) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: self.views.clone(),
+            buffers: self.buffers.clone(),
+            nulls: self.nulls.clone(),
+            phantom: Default::default(),
+        }
+    }
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
+    /// Create a new [`GenericByteViewArray`] from the provided parts, panicking on failure
+    ///
+    /// # Panics
+    ///
+    /// Panics if [`GenericByteViewArray::try_new`] returns an error
+    pub fn new(views: ScalarBuffer<u128>, buffers: Vec<Buffer>, nulls: Option<NullBuffer>) -> Self {
+        Self::try_new(views, buffers, nulls).unwrap()
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, returning an error on failure
+    ///
+    /// # Errors
+    ///
+    /// * `views.len() != nulls.len()`
+    /// * [ByteViewType::validate] fails
+    pub fn try_new(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Result<Self, ArrowError> {
+        T::validate(&views, &buffers)?;
+
+        if let Some(n) = nulls.as_ref() {
+            if n.len() != views.len() {
+                return Err(ArrowError::InvalidArgumentError(format!(
+                    "Incorrect length of null buffer for {}ViewArray, expected {} got {}",
+                    T::PREFIX,
+                    views.len(),
+                    n.len(),
+                )));
+            }
+        }
+
+        Ok(Self {
+            data_type: T::DATA_TYPE,
+            views,
+            buffers,
+            nulls,
+            phantom: Default::default(),
+        })
+    }
+
+    /// Create a new [`GenericByteViewArray`] from the provided parts, without validation
+    ///
+    /// # Safety
+    ///
+    /// Safe if [`Self::try_new`] would not error
+    pub unsafe fn new_unchecked(
+        views: ScalarBuffer<u128>,
+        buffers: Vec<Buffer>,
+        nulls: Option<NullBuffer>,
+    ) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            phantom: Default::default(),
+            views,
+            buffers,
+            nulls,
+        }
+    }
+
+    /// Create a new [`GenericByteViewArray`] of length `len` where all values are null
+    pub fn new_null(len: usize) -> Self {
+        Self {
+            data_type: T::DATA_TYPE,
+            views: vec![0; len].into(),
+            buffers: vec![],
+            nulls: Some(NullBuffer::new_null(len)),
+            phantom: Default::default(),
+        }
+    }
+
+    /// Creates a [`GenericByteViewArray`] based on an iterator of values without nulls
+    pub fn from_iter_values<Ptr, I>(iter: I) -> Self
+    where
+        Ptr: AsRef<T::Native>,
+        I: IntoIterator<Item = Ptr>,
+    {
+        let iter = iter.into_iter();
+        let mut builder = GenericByteViewBuilder::<T>::with_capacity(iter.size_hint().0);
+        for v in iter {
+            builder.append_value(v);
+        }
+        builder.finish()
+    }
+
+    /// Deconstruct this array into its constituent parts
+    pub fn into_parts(self) -> (ScalarBuffer<u128>, Vec<Buffer>, Option<NullBuffer>) {
+        (self.views, self.buffers, self.nulls)
+    }
+
+    /// Returns the views buffer
+    #[inline]
+    pub fn views(&self) -> &ScalarBuffer<u128> {

Review Comment:
   That's a good idea, I'll do it in the next PR.



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "XiangpengHao (via GitHub)" <gi...@apache.org>.
XiangpengHao commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1521832106


##########
arrow-data/src/equal/byte_view.rs:
##########
@@ -0,0 +1,74 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::{ArrayData, ByteView};
+
+pub(super) fn byte_view_equal(
+    lhs: &ArrayData,
+    rhs: &ArrayData,
+    lhs_start: usize,
+    rhs_start: usize,
+    len: usize,
+) -> bool {
+    let lhs_views = &lhs.buffer::<u128>(0)[lhs_start..lhs_start + len];
+    let lhs_buffers = &lhs.buffers()[1..];
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_buffers = &rhs.buffers()[1..];
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len_prefix = *l as u64;
+        let r_len_prefix = *r as u64;
+        // short-circuit, check length and prefix
+        if l_len_prefix != r_len_prefix {

Review Comment:
   For strings smaller than 4 bytes, this line compares the string bytes along with its paddings. For this to be correct, we need to ensure the paddings are 0, is this specified?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522259241


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data

Review Comment:
   It is technically a minimum, as if a single string exceeds the entire buffer capacity it will allocate a fresh buffer for it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522307370


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings
+    pub fn with_capacity(capacity: usize) -> Self {
+        Self {
+            views_builder: BufferBuilder::new(capacity),
+            null_buffer_builder: NullBufferBuilder::new(capacity),
+            completed: vec![],
+            in_progress: vec![],
+            block_size: DEFAULT_BLOCK_SIZE,
+            phantom: Default::default(),
+        }
+    }
+
+    /// Override the minimum size of buffers to allocate for string data
+    pub fn with_block_size(self, block_size: u32) -> Self {
+        Self { block_size, ..self }
+    }
+
+    /// Appends a value into the builder
+    ///
+    /// # Panics
+    ///
+    /// Panics if
+    /// - String buffer count exceeds `u32::MAX`
+    /// - String length exceeds `u32::MAX`
+    #[inline]
+    pub fn append_value(&mut self, value: impl AsRef<T::Native>) {
+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut view_buffer = [0; 16];
+            view_buffer[0..4].copy_from_slice(&length.to_le_bytes());
+            view_buffer[4..4 + v.len()].copy_from_slice(v);
+            self.views_builder.append(u128::from_le_bytes(view_buffer));
+            self.null_buffer_builder.append_non_null();
+            return;
+        }
+
+        let required_cap = self.in_progress.len() + v.len();
+        if self.in_progress.capacity() < required_cap {

Review Comment:
   Can you point me to where we define capacity in the public API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] feat: initial support string_view and binary_view, supports layout and basic construction + tests [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on code in PR #5481:
URL: https://github.com/apache/arrow-rs/pull/5481#discussion_r1522308464


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -0,0 +1,215 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::builder::ArrayBuilder;
+use crate::types::{BinaryViewType, ByteViewType, StringViewType};
+use crate::{ArrayRef, GenericByteViewArray};
+use arrow_buffer::{Buffer, BufferBuilder, NullBufferBuilder, ScalarBuffer};
+use arrow_data::ByteView;
+use std::any::Any;
+use std::marker::PhantomData;
+use std::sync::Arc;
+
+const DEFAULT_BLOCK_SIZE: u32 = 8 * 1024;
+
+/// A builder for [`GenericByteViewArray`]
+///
+/// See [`Self::append_value`] for the allocation strategy
+pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> {
+    views_builder: BufferBuilder<u128>,
+    null_buffer_builder: NullBufferBuilder,
+    completed: Vec<Buffer>,
+    in_progress: Vec<u8>,
+    block_size: u32,
+    phantom: PhantomData<T>,
+}
+
+impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
+    /// Creates a new [`GenericByteViewBuilder`].
+    pub fn new() -> Self {
+        Self::with_capacity(1024)
+    }
+
+    /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` strings

Review Comment:
   ```suggestion
       /// Creates a new [`GenericByteViewBuilder`] with space for `capacity` string values
   ```
   
   Perhaps @alamb this is what is causing you confusion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org