You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/09 16:30:33 UTC

[GitHub] [arrow-rs] alamb commented on a diff in pull request #3818: Add ArrayDataLayout, port validation (#1799)

alamb commented on code in PR #3818:
URL: https://github.com/apache/arrow-rs/pull/3818#discussion_r1131229965


##########
arrow-buffer/src/buffer/offset.rs:
##########
@@ -39,6 +39,21 @@ impl<O: ArrowNativeType> OffsetBuffer<O> {
         let buffer = MutableBuffer::from_len_zeroed(std::mem::size_of::<O>());
         Self(buffer.into_buffer().into())
     }
+
+    /// Returns the inner [`ScalarBuffer`]
+    pub fn inner(&self) -> &ScalarBuffer<O> {
+        &self.0
+    }
+
+    /// Returns the inner [`ScalarBuffer`]

Review Comment:
   ```suggestion
       /// Returns the inner [`ScalarBuffer`], consuming self
   ```



##########
arrow-data/src/data/list.rs:
##########
@@ -157,18 +198,56 @@ impl<O: ListOffset> ListArrayData<O> {
     /// - `data_type` must be valid for this layout

Review Comment:
   shouldn't there be a note here about values being of sufficient size 🤔 



##########
arrow-data/src/data/struct.rs:
##########
@@ -78,4 +99,31 @@ impl StructArrayData {
     pub fn data_type(&self) -> &DataType {
         &self.data_type
     }
+
+    /// Returns the underlying parts of this [`StructArrayData`]
+    pub fn into_parts(self) -> (DataType, Option<NullBuffer>, Vec<ArrayData>) {
+        (self.data_type, self.nulls, self.children)
+    }
+
+    /// Returns a zero-copy slice of this array
+    pub fn slice(&self, offset: usize, len: usize) -> Self {
+        Self {
+            len,
+            data_type: self.data_type.clone(),
+            nulls: self.nulls.as_ref().map(|x| x.slice(offset, len)),
+            children: self.children.iter().map(|c| c.slice(offset, len)).collect(),
+        }
+    }
+
+    /// Returns an [`ArrayDataLayout`] representation of this

Review Comment:
   ```suggestion
       /// Returns an [`ArrayDataLayout`] representation of this [`StructArrayData`]
   ```



##########
arrow-data/src/data/mod.rs:
##########
@@ -740,11 +730,101 @@ impl ArrayData {
     /// See [ArrayData::validate_data] to validate fully the offset content
     /// and the validity of utf8 data
     pub fn validate(&self) -> Result<(), ArrowError> {
+        ArrayDataLayout::new(self).validate()
+    }
+
+    /// Validate that the data contained within this [`ArrayData`] is valid
+    ///
+    /// 1. Null count is correct
+    /// 2. All offsets are valid
+    /// 3. All String data is valid UTF-8
+    /// 4. All dictionary offsets are valid
+    ///
+    /// Internally this calls:
+    ///
+    /// * [`Self::validate`]
+    /// * [`Self::validate_nulls`]
+    /// * [`Self::validate_values`]
+    ///
+    /// Note: this does not recurse into children, for a recursive variant
+    /// see [`Self::validate_full`]
+    pub fn validate_data(&self) -> Result<(), ArrowError> {
+        ArrayDataLayout::new(self).validate_data()
+    }
+
+    /// Performs a full recursive validation of this [`ArrayData`] and all its children
+    ///
+    /// This is equivalent to calling [`Self::validate_data`] on this [`ArrayData`]
+    /// and all its children recursively
+    pub fn validate_full(&self) -> Result<(), ArrowError> {
+        ArrayDataLayout::new(self).validate_full()
+    }
+
+    /// Validates the values stored within this [`ArrayData`] are valid

Review Comment:
   What is the usecase for such fine grained validation checking? I undertsand the usecase for `validate()` and `validate_full` but I don't understand when it would be useful to do `validate_null`, etc 
   
   If there is not a good usecase for such fine grained checking, perhaps we can leave these methods off of `ArrayData` ? Adding them to the public API seems unnecessary, but I may be missing something



##########
arrow-data/src/data/mod.rs:
##########
@@ -598,14 +594,8 @@ impl ArrayData {
     /// This function panics if:
     /// * the buffer is not byte-aligned with type T, or
     /// * the datatype is `Boolean` (it corresponds to a bit-packed buffer where the offset is not applicable)
-    #[inline]

Review Comment:
   why remove `inline`? 



##########
arrow-buffer/src/buffer/scalar.rs:
##########
@@ -50,6 +50,21 @@ impl<T: ArrowNativeType> ScalarBuffer<T> {
         let byte_len = len.checked_mul(size).expect("length overflow");
         buffer.slice_with_length(byte_offset, byte_len).into()
     }
+
+    /// Returns a zero-copy slice of this buffer with length `len` and starting at `offset`
+    pub fn slice(&self, offset: usize, len: usize) -> Self {
+        Self::new(self.buffer.clone(), offset, len)
+    }
+
+    /// Returns the inner [`Buffer`]

Review Comment:
   ```suggestion
       /// Returns a reference to the inner [`Buffer`]
   ```



##########
arrow-data/src/data/bytes.rs:
##########
@@ -309,14 +409,44 @@ impl<O: BytesOffset, B: Bytes + ?Sized> BytesArrayData<O, B> {
     pub fn data_type(&self) -> &DataType {
         &self.data_type
     }
+
+    /// Returns the underlying parts of this [`BytesArrayData`]
+    pub fn into_parts(self) -> (DataType, OffsetBuffer<O>, Buffer, Option<NullBuffer>) {
+        (self.data_type, self.offsets, self.values, self.nulls)
+    }
+
+    /// Returns a zero-copy slice of this array
+    pub fn slice(&self, offset: usize, len: usize) -> Self {
+        Self {
+            values: self.values.clone(),
+            offsets: self.offsets.slice(offset, len),
+            data_type: self.data_type.clone(),
+            nulls: self.nulls().as_ref().map(|x| x.slice(offset, len)),
+            phantom: Default::default(),
+        }
+    }
+
+    /// Returns an [`ArrayDataLayout`] representation of this
+    pub(crate) fn layout(&self) -> ArrayDataLayout<'_> {
+        ArrayDataLayout {
+            data_type: &self.data_type,
+            len: self.offsets.len().wrapping_sub(1),
+            offset: 0,
+            nulls: self.nulls.as_ref(),
+            buffers: Buffers::two(self.offsets.inner().inner(), &self.values),
+            child_data: &[],
+        }
+    }
 }
 
 /// ArrayData for [fixed-size arrays](https://arrow.apache.org/docs/format/Columnar.html#fixed-size-primitive-layout) of bytes

Review Comment:
   is this the right link?



##########
arrow-buffer/src/buffer/scalar.rs:
##########
@@ -50,6 +50,21 @@ impl<T: ArrowNativeType> ScalarBuffer<T> {
         let byte_len = len.checked_mul(size).expect("length overflow");
         buffer.slice_with_length(byte_offset, byte_len).into()
     }
+
+    /// Returns a zero-copy slice of this buffer with length `len` and starting at `offset`
+    pub fn slice(&self, offset: usize, len: usize) -> Self {
+        Self::new(self.buffer.clone(), offset, len)
+    }
+
+    /// Returns the inner [`Buffer`]
+    pub fn inner(&self) -> &Buffer {
+        &self.buffer
+    }
+
+    /// Returns the inner [`Buffer`]

Review Comment:
   ```suggestion
       /// Returns the inner [`Buffer`], consuming self
   ```



##########
arrow-data/src/data/null.rs:
##########
@@ -0,0 +1,104 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::data::types::PhysicalType;
+use crate::data::ArrayDataLayout;
+use crate::{ArrayDataBuilder, Buffers};
+use arrow_schema::DataType;
+
+/// ArrayData for [null arrays](https://arrow.apache.org/docs/format/Columnar.html#null-layout)
+#[derive(Debug, Clone)]
+pub struct NullArrayData {
+    data_type: DataType,
+    len: usize,
+}
+
+impl NullArrayData {
+    /// Create a new [`NullArrayData`]
+    ///
+    /// # Panic
+    ///
+    /// - `data_type` is not compatible with `T`

Review Comment:
   I am not sure what `T` means here (maybe copy/paste error?) This is basically the same as @viirya 's earlier comment



##########
arrow-data/src/data/null.rs:
##########
@@ -0,0 +1,104 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::data::types::PhysicalType;
+use crate::data::ArrayDataLayout;
+use crate::{ArrayDataBuilder, Buffers};
+use arrow_schema::DataType;
+
+/// ArrayData for [null arrays](https://arrow.apache.org/docs/format/Columnar.html#null-layout)
+#[derive(Debug, Clone)]
+pub struct NullArrayData {
+    data_type: DataType,
+    len: usize,
+}
+
+impl NullArrayData {
+    /// Create a new [`NullArrayData`]
+    ///
+    /// # Panic
+    ///
+    /// - `data_type` is not compatible with `T`
+    pub fn new(data_type: DataType, len: usize) -> Self {
+        assert_eq!(
+            PhysicalType::from(&data_type),
+            PhysicalType::Null,
+            "Illegal physical type for NullArrayData of datatype {data_type:?}",
+        );
+        Self { data_type, len }
+    }
+
+    /// Create a new [`NullArrayData`]
+    ///
+    /// # Safety
+    ///
+    /// - `PhysicalType::from(&data_type) == PhysicalType::Null`
+    pub unsafe fn new_unchecked(data_type: DataType, len: usize) -> Self {
+        Self { data_type, len }
+    }
+
+    /// Creates a new [`NullArrayData`] from raw buffers
+    ///
+    /// # Safety
+    ///
+    /// See [`NullArrayData::new_unchecked`]
+    pub(crate) unsafe fn from_raw(builder: ArrayDataBuilder) -> Self {
+        Self {
+            data_type: builder.data_type,
+            len: builder.len,
+        }
+    }
+
+    /// Returns the data type of this array
+    #[inline]
+    pub fn data_type(&self) -> &DataType {
+        &self.data_type
+    }
+
+    /// Returns the length of this array
+    #[inline]
+    pub fn len(&self) -> usize {
+        self.len
+    }
+
+    /// Returns the [`DataType`] and length of this [`NullArrayData`]
+    pub fn into_parts(self) -> (DataType, usize) {
+        (self.data_type, self.len)
+    }
+
+    /// Returns a zero-copy slice of this array
+    pub fn slice(&self, offset: usize, len: usize) -> Self {
+        let new_len = offset.saturating_add(len);
+        assert!(new_len <= self.len,);

Review Comment:
   ```suggestion
           assert!(new_len <= self.len);
   ```



##########
arrow-data/src/data/primitive.rs:
##########
@@ -145,13 +208,10 @@ impl<T: Primitive> PrimitiveArrayData<T> {
         values: ScalarBuffer<T>,
         nulls: Option<NullBuffer>,
     ) -> Self {
-        let physical = PhysicalType::from(&data_type);
-        assert!(
-            matches!(physical, PhysicalType::Primitive(p) if p == T::TYPE),
-            "Illegal physical type for PrimitiveArrayData of datatype {:?}, expected {:?} got {:?}",
-            data_type,
-            T::TYPE,
-            physical
+        assert_eq!(
+            PhysicalType::from(&data_type),
+            PhysicalType::Primitive(T::TYPE),
+            "Illegal physical type for PrimitiveArrayData of datatype {data_type:?}",

Review Comment:
   It seems like the previous message was more specific -- is there any reason for this change?



##########
arrow-buffer/src/buffer/boolean.rs:
##########
@@ -139,4 +139,9 @@ impl BooleanBuffer {
     pub fn inner(&self) -> &Buffer {
         &self.buffer
     }
+
+    /// Returns the inner [`Buffer`]

Review Comment:
   ```suggestion
       /// Returns the inner [`Buffer`], consuming self
   ```



##########
arrow-data/src/data/mod.rs:
##########
@@ -740,11 +730,101 @@ impl ArrayData {
     /// See [ArrayData::validate_data] to validate fully the offset content
     /// and the validity of utf8 data
     pub fn validate(&self) -> Result<(), ArrowError> {
+        ArrayDataLayout::new(self).validate()
+    }
+
+    /// Validate that the data contained within this [`ArrayData`] is valid
+    ///
+    /// 1. Null count is correct
+    /// 2. All offsets are valid
+    /// 3. All String data is valid UTF-8
+    /// 4. All dictionary offsets are valid
+    ///
+    /// Internally this calls:
+    ///
+    /// * [`Self::validate`]
+    /// * [`Self::validate_nulls`]
+    /// * [`Self::validate_values`]
+    ///
+    /// Note: this does not recurse into children, for a recursive variant
+    /// see [`Self::validate_full`]
+    pub fn validate_data(&self) -> Result<(), ArrowError> {
+        ArrayDataLayout::new(self).validate_data()
+    }
+
+    /// Performs a full recursive validation of this [`ArrayData`] and all its children
+    ///
+    /// This is equivalent to calling [`Self::validate_data`] on this [`ArrayData`]
+    /// and all its children recursively
+    pub fn validate_full(&self) -> Result<(), ArrowError> {
+        ArrayDataLayout::new(self).validate_full()
+    }
+
+    /// Validates the values stored within this [`ArrayData`] are valid

Review Comment:
   ```suggestion
       /// Validates the null values stored within this [`ArrayData`] are valid
   ```



##########
arrow-data/src/data/run.rs:
##########
@@ -88,26 +105,63 @@ pub enum ArrayDataRun {
 impl ArrayDataRun {
     /// Downcast this [`ArrayDataRun`] to the corresponding [`RunArrayData`]
     pub fn downcast_ref<E: RunEnd>(&self) -> Option<&RunArrayData<E>> {
-        E::downcast_ref(self)
+        <E as private::RunEndSealed>::downcast_ref(self)
     }
 
     /// Downcast this [`ArrayDataRun`] to the corresponding [`RunArrayData`]
     pub fn downcast<E: RunEnd>(self) -> Option<RunArrayData<E>> {
-        E::downcast(self)
+        <E as private::RunEndSealed>::downcast(self)
+    }
+
+    /// Returns the values of this [`ArrayDataRun`]
+    #[inline]
+    pub fn values(&self) -> &ArrayData {
+        let s = self;
+        run_op!(s, { s.values() })
+    }
+
+    /// Returns a zero-copy slice of this array
+    pub fn slice(&self, offset: usize, len: usize) -> Self {
+        let s = self;
+        run_op!(s, { s.slice(offset, len).into() })
+    }
+
+    /// Returns an [`ArrayDataLayout`] representation of this
+    pub(crate) fn layout(&self) -> ArrayDataLayout<'_> {
+        let s = self;
+        run_op!(s, { s.layout() })
+    }
+
+    /// Creates a new [`ArrayDataRun`] from raw buffers
+    ///
+    /// # Safety
+    ///
+    /// See [`RunArrayData::new_unchecked`]
+    pub(crate) unsafe fn from_raw(builder: ArrayDataBuilder, run: RunEndType) -> Self {
+        use RunEndType::*;
+        match run {
+            Int16 => Self::Int16(RunArrayData::from_raw(builder)),
+            Int32 => Self::Int32(RunArrayData::from_raw(builder)),
+            Int64 => Self::Int64(RunArrayData::from_raw(builder)),
+        }
     }
 }
 
 impl<E: RunEnd> From<RunArrayData<E>> for ArrayDataRun {
     fn from(value: RunArrayData<E>) -> Self {
-        E::upcast(value)
+        <E as private::RunEndSealed>::upcast(value)
     }
 }
 
 /// ArrayData for [run-end encoded arrays](https://arrow.apache.org/docs/format/Columnar.html#run-end-encoded-layout)
+#[derive(Debug, Clone)]
 pub struct RunArrayData<E: RunEnd> {
     data_type: DataType,
-    run_ends: ScalarBuffer<E>,
-    child: Box<ArrayData>,
+    run_ends: RunEndBuffer<E>,
+    /// The children of this RunArrayData
+    /// 1: the run ends
+    /// 2: the values
+    children: Box<[ArrayData; 2]>,

Review Comment:
   What is there a reason to store the run ends in `children[0]` as well as in `run_ends`? Aka why the redundancy.



##########
arrow-data/src/data/mod.rs:
##########
@@ -33,13 +33,17 @@ use crate::equal;
 mod buffers;
 pub use buffers::*;
 
+#[allow(unused)] // Private until ready (#1176)

Review Comment:
   I wonder if https://github.com/apache/arrow-rs/issues/1799 would be a better link?
   
   https://github.com/apache/arrow-rs/issues/1176 was an old discussion ticket
   
   



##########
arrow-data/src/data/primitive.rs:
##########
@@ -90,6 +113,7 @@ primitive!(f32, Float32);
 primitive!(f64, Float64);
 
 /// An enumeration of the types of [`PrimitiveArrayData`]
+#[derive(Debug, Clone)]

Review Comment:
   would it be valuable to also implement `PartialEq` on these types as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org