You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/12 17:47:15 UTC

[GitHub] [arrow-rs] alamb commented on a diff in pull request #4061: Remove ArrayData from Array (#3880)

alamb commented on code in PR #4061:
URL: https://github.com/apache/arrow-rs/pull/4061#discussion_r1164430135


##########
arrow-array/src/array/dictionary_array.rs:
##########
@@ -697,13 +724,37 @@ impl<'a, K: ArrowDictionaryKeyType, V: Sync> Array for TypedDictionaryArray<'a,
         self.dictionary.into_data()
     }
 
+    fn data_type(&self) -> &DataType {
+        self.dictionary.data_type()
+    }
+
     fn slice(&self, offset: usize, length: usize) -> ArrayRef {
         Arc::new(self.dictionary.slice(offset, length))
     }
 
+    fn len(&self) -> usize {
+        self.dictionary.len()
+    }
+
+    fn is_empty(&self) -> bool {
+        self.dictionary.is_empty()
+    }
+
+    fn offset(&self) -> usize {
+        self.dictionary.offset()
+    }
+
     fn nulls(&self) -> Option<&NullBuffer> {
         self.dictionary.nulls()
     }
+
+    fn get_buffer_memory_size(&self) -> usize {
+        self.dictionary.get_buffer_memory_size()

Review Comment:
   should this function and the one below it also account for the `self.values`?



##########
arrow-array/src/array/boolean_array.rs:
##########
@@ -90,27 +90,25 @@ impl BooleanArray {
         if let Some(n) = nulls.as_ref() {
             assert_eq!(values.len(), n.len());
         }
-
-        // TODO: Don't store ArrayData inside arrays (#3880)
-        let data = unsafe {
-            ArrayData::builder(DataType::Boolean)
-                .len(values.len())
-                .offset(values.offset())
-                .nulls(nulls)
-                .buffers(vec![values.inner().clone()])
-                .build_unchecked()
-        };
-        Self { data, values }
+        Self { values, nulls }
     }
 
     /// Returns the length of this array.
     pub fn len(&self) -> usize {
-        self.data.len()
+        self.values.len()
     }
 
     /// Returns whether this array is empty.
     pub fn is_empty(&self) -> bool {
-        self.data.is_empty()
+        self.values.is_empty()
+    }
+
+    /// Returns a zero-copy slice of this array with the indicated offset and length.

Review Comment:
   ❤️ 



##########
arrow-array/src/array/fixed_size_list_array.rs:
##########
@@ -210,24 +213,52 @@ impl Array for FixedSizeListArray {
         self
     }
 
-    fn data(&self) -> &ArrayData {
-        &self.data
-    }
-
     fn to_data(&self) -> ArrayData {
-        self.data.clone()
+        self.clone().into()
     }
 
     fn into_data(self) -> ArrayData {
         self.into()
     }
 
+    fn data_type(&self) -> &DataType {
+        &self.data_type
+    }
+
     fn slice(&self, offset: usize, length: usize) -> ArrayRef {
         Arc::new(self.slice(offset, length))
     }
 
+    fn len(&self) -> usize {
+        self.len
+    }
+
+    fn is_empty(&self) -> bool {
+        self.len == 0
+    }
+
+    fn offset(&self) -> usize {

Review Comment:
   what is the meaning of `offset` in this new world where the offset is stored on the underlying typed array? Maybe we should remove it from the Array trait 🤔 



##########
arrow-array/src/array/map_array.rs:
##########
@@ -229,34 +244,54 @@ impl Array for MapArray {
         self
     }
 
-    fn data(&self) -> &ArrayData {
-        &self.data
-    }
-
     fn to_data(&self) -> ArrayData {
-        self.data.clone()
+        self.clone().into_data()
     }
 
     fn into_data(self) -> ArrayData {
         self.into()
     }
 
+    fn data_type(&self) -> &DataType {
+        &self.data_type
+    }
+
     fn slice(&self, offset: usize, length: usize) -> ArrayRef {
         Arc::new(self.slice(offset, length))
     }
 
+    fn len(&self) -> usize {
+        self.value_offsets.len() - 1
+    }
+
+    fn is_empty(&self) -> bool {
+        self.value_offsets.len() <= 1
+    }
+
+    fn offset(&self) -> usize {
+        0
+    }
+
     fn nulls(&self) -> Option<&NullBuffer> {
-        self.data.nulls()
+        self.nulls.as_ref()
     }
 
-    /// Returns the total number of bytes of memory occupied by the buffers owned by this [MapArray].
     fn get_buffer_memory_size(&self) -> usize {
-        self.data.get_buffer_memory_size()
+        let mut size = self.entries.get_buffer_memory_size();

Review Comment:
   What about the other fields like `keys`, `values`, etc?



##########
arrow-array/src/array/mod.rs:
##########
@@ -184,19 +164,15 @@ pub trait Array: std::fmt::Debug + Send + Sync {
     /// # Example:
     ///
     /// ```
-    /// use arrow_array::{Array, Int32Array};
+    /// use arrow_array::{Array, BooleanArray};

Review Comment:
   I think we should deprecate this API for the reasons explained abve



##########
arrow-array/src/array/dictionary_array.rs:
##########
@@ -325,8 +323,12 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> {
 
     /// Returns a zero-copy slice of this array with the indicated offset and length.
     pub fn slice(&self, offset: usize, length: usize) -> Self {
-        // TODO: Slice buffers directly (#3880)
-        self.data.slice(offset, length).into()
+        Self {
+            data_type: self.data_type.clone(),
+            keys: self.keys.slice(offset, length),
+            values: self.values.clone(),

Review Comment:
   the values aren't sliced because the keys may still refer to any logical value in the dictionary



##########
arrow-ipc/src/writer.rs:
##########
@@ -1107,94 +1107,29 @@ fn get_buffer_element_width(spec: &BufferSpec) -> usize {
     }
 }
 
-/// Returns byte width for binary value_offset buffer spec.
-#[inline]
-fn get_value_offset_byte_width(data_type: &DataType) -> usize {
-    match data_type {
-        DataType::Binary | DataType::Utf8 => 4,
-        DataType::LargeBinary | DataType::LargeUtf8 => 8,
-        _ => unreachable!(),
+/// Returns the values and offsets [`Buffer`] for a ByteArray with offset type `O`
+fn get_byte_array_buffers<O: OffsetSizeTrait>(data: &ArrayData) -> (Buffer, Buffer) {
+    if data.is_empty() {
+        return (MutableBuffer::new(0).into(), MutableBuffer::new(0).into());
     }
-}
 
-/// Returns the number of total bytes in base binary arrays.
-fn get_binary_buffer_len(array_data: &ArrayData) -> usize {
-    if array_data.is_empty() {
-        return 0;
-    }
-    match array_data.data_type() {
-        DataType::Binary => {
-            let array: BinaryArray = array_data.clone().into();
-            let offsets = array.value_offsets();
-            (offsets[array_data.len()] - offsets[0]) as usize
-        }
-        DataType::LargeBinary => {
-            let array: LargeBinaryArray = array_data.clone().into();
-            let offsets = array.value_offsets();
-            (offsets[array_data.len()] - offsets[0]) as usize
-        }
-        DataType::Utf8 => {
-            let array: StringArray = array_data.clone().into();
-            let offsets = array.value_offsets();
-            (offsets[array_data.len()] - offsets[0]) as usize
-        }
-        DataType::LargeUtf8 => {
-            let array: LargeStringArray = array_data.clone().into();
-            let offsets = array.value_offsets();
-            (offsets[array_data.len()] - offsets[0]) as usize
-        }
-        _ => unreachable!(),
-    }
-}
+    let buffers = data.buffers();
+    let offsets: &[O] = buffers[0].typed_data::<O>();
+    let offset_slice = &offsets[data.offset()..data.offset() + data.len() + 1];
 
-/// Rebase value offsets for given ArrayData to zero-based.
-fn get_zero_based_value_offsets<OffsetSize: OffsetSizeTrait>(
-    array_data: &ArrayData,
-) -> Buffer {
-    match array_data.data_type() {
-        DataType::Binary | DataType::LargeBinary => {
-            let array: GenericBinaryArray<OffsetSize> = array_data.clone().into();
-            let offsets = array.value_offsets();
-            let start_offset = offsets[0];
-
-            let mut builder = BufferBuilder::<OffsetSize>::new(array_data.len() + 1);
-            for x in offsets {
-                builder.append(*x - start_offset);
-            }
-
-            builder.finish()
-        }
-        DataType::Utf8 | DataType::LargeUtf8 => {
-            let array: GenericStringArray<OffsetSize> = array_data.clone().into();
-            let offsets = array.value_offsets();
-            let start_offset = offsets[0];
-
-            let mut builder = BufferBuilder::<OffsetSize>::new(array_data.len() + 1);
-            for x in offsets {
-                builder.append(*x - start_offset);
-            }
+    let start_offset = offset_slice.first().unwrap();
+    let end_offset = offset_slice.last().unwrap();
 
-            builder.finish()
-        }
-        _ => unreachable!(),
-    }
-}
+    let offsets = match start_offset.as_usize() {

Review Comment:
   could you please add some comments about what this is doing (namely that it is "rebasing" the offsets to all be zero based)? That would be easy to overlook and was somewhat encoded in the method names previously



##########
arrow-array/src/array/list_array.rs:
##########
@@ -649,11 +693,10 @@ mod tests {
 
         let sliced_array = list_array.slice(1, 6);
         assert_eq!(6, sliced_array.len());
-        assert_eq!(1, sliced_array.offset());
         assert_eq!(3, sliced_array.null_count());
 
         for i in 0..sliced_array.len() {
-            if bit_util::get_bit(&null_bits, sliced_array.offset() + i) {
+            if bit_util::get_bit(&null_bits, 1 + i) {

Review Comment:
   As above I think maybe it would make sense to at least deprecate `Array::offset` as it is always 0 now (and as part of the deprecation message explain why)



##########
arrow-array/src/array/union_array.rs:
##########
@@ -290,16 +290,32 @@ impl UnionArray {
 
     /// Returns whether the `UnionArray` is dense (or sparse if `false`).
     fn is_dense(&self) -> bool {
-        match self.data.data_type() {
+        match self.data_type() {
             DataType::Union(_, mode) => mode == &UnionMode::Dense,
             _ => unreachable!("Union array's data type is not a union!"),
         }
     }
 
     /// Returns a zero-copy slice of this array with the indicated offset and length.
     pub fn slice(&self, offset: usize, length: usize) -> Self {
-        // TODO: Slice buffers directly (#3880)
-        self.data.slice(offset, length).into()
+        let (offsets, fields) = match self.offsets.as_ref() {
+            Some(offsets) => (Some(offsets.slice(offset, length)), self.fields.clone()),

Review Comment:
   I was confused by this at first -- some comments here (and/or in the `self.offsets` I think would help
   
   Specifically I think  if `self.offsets` is `SOme(..)` that means `DenseUnion` and the offsets point to the correct children so slicing the offsets but not the actual children makes sense
   
   
   if `self.offsets` is None, then it is a sparse union, so all the children line up and we need to slice them individually.
   
   I had to re-read https://arrow.apache.org/docs/format/Columnar.html#union-layout to make sure



##########
arrow-array/src/array/fixed_size_binary_array.rs:
##########
@@ -94,26 +96,38 @@ impl FixedSizeBinaryArray {
     /// Note this doesn't do any bound checking, for performance reason.
     #[inline]
     pub fn value_offset(&self, i: usize) -> i32 {
-        self.value_offset_at(self.data.offset() + i)
+        self.value_offset_at(self.offset() + i)
     }
 
     /// Returns the length for an element.
     ///
     /// All elements have the same length as the array is a fixed size.
     #[inline]
     pub fn value_length(&self) -> i32 {
-        self.length

Review Comment:
   this is a bug, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org