You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/25 20:42:47 UTC

[GitHub] [arrow-rs] alamb commented on a change in pull request #1225: Improve MutableArrayData Null Handling (#1224) (#1230)

alamb commented on a change in pull request #1225:
URL: https://github.com/apache/arrow-rs/pull/1225#discussion_r792097734



##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -377,9 +375,12 @@ impl<'a> MutableArrayData<'a> {
     /// returns a new [MutableArrayData] with capacity to `capacity` slots and specialized to create an
     /// [ArrayData] from multiple `arrays`.
     ///
-    /// `use_nulls` is a flag used to optimize insertions. It should be `false` if the only source of nulls
-    /// are the arrays themselves and `true` if the user plans to call [MutableArrayData::extend_nulls].
-    /// In other words, if `use_nulls` is `false`, calling [MutableArrayData::extend_nulls] should not be used.
+    /// `use_nulls` is a flag used to optimize insertions, if `use_nulls` is `true` a null bitmap
+    /// will be created regardless of the contents of `arrays`, otherwise a null bitmap will
+    /// be computed only if `arrays` contains nulls.
+    ///
+    /// Code that plans to call [MutableArrayData::extend_nulls] MUST set `use_nulls` to `true`,
+    /// in order to ensure that a null bitmap is computed.

Review comment:
       ```suggestion
       /// in order to ensure that a null bitmap is computed, otherwise a panic will result.
   ```

##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -377,9 +375,12 @@ impl<'a> MutableArrayData<'a> {
     /// returns a new [MutableArrayData] with capacity to `capacity` slots and specialized to create an
     /// [ArrayData] from multiple `arrays`.
     ///
-    /// `use_nulls` is a flag used to optimize insertions. It should be `false` if the only source of nulls
-    /// are the arrays themselves and `true` if the user plans to call [MutableArrayData::extend_nulls].
-    /// In other words, if `use_nulls` is `false`, calling [MutableArrayData::extend_nulls] should not be used.
+    /// `use_nulls` is a flag used to optimize insertions, if `use_nulls` is `true` a null bitmap
+    /// will be created regardless of the contents of `arrays`, otherwise a null bitmap will
+    /// be computed only if `arrays` contains nulls.
+    ///
+    /// Code that plans to call [MutableArrayData::extend_nulls] MUST set `use_nulls` to `true`,
+    /// in order to ensure that a null bitmap is computed.

Review comment:
       I think it will panic otherwise, right? Rather than get undefined behavior?

##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -556,19 +557,20 @@ impl<'a> MutableArrayData<'a> {
             _ => (None, false),
         };
 
-        let extend_nulls = build_extend_nulls(data_type);
-
-        let extend_null_bits = arrays
-            .iter()
-            .map(|array| build_extend_null_bits(array, use_nulls))
-            .collect();
-
-        let null_buffer = if use_nulls {
-            let null_bytes = bit_util::ceil(array_capacity, 8);
-            MutableBuffer::from_len_zeroed(null_bytes)
+        let (null_buffer, extend_nulls, extend_null_bits) = if use_nulls {
+            let extend_null_bits = arrays
+                .iter()
+                .map(|array| build_extend_null_bits(array))
+                .collect();
+
+            (
+                Some(BooleanBufferBuilder::new(array_capacity)),
+                Some(build_extend_nulls(data_type)),
+                extend_null_bits,
+            )
         } else {
-            // create 0 capacity mutable buffer with the intention that it won't be used
-            MutableBuffer::with_capacity(0)
+            // create no null buffer and no extend_null_bits

Review comment:
       nice 👍 

##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -1452,6 +1465,30 @@ mod tests {
         assert_eq!(&result, expected.data());
     }
 
+    #[test]
+    #[should_panic(
+        expected = "Cannot append nulls to MutableArrayData created with nulls disabled"
+    )]
+    fn test_disabled_nulls() {
+        let ints: ArrayRef =
+            Arc::new(Int32Array::from(vec![Some(1), Some(2), Some(4), Some(5)]));
+
+        let mut data = MutableArrayData::new(vec![ints.data()], false, 3);
+        data.extend(0, 1, 2);
+        data.extend_nulls(1);
+    }
+
+    #[test]
+    fn test_nulls() {
+        let ints: ArrayRef = Arc::new(Int32Array::from(vec![1]));
+        let mut data = MutableArrayData::new(vec![ints.data()], true, 3);
+        data.extend_nulls(9);
+        let data = data.freeze();
+
+        assert_eq!(data.len(), 9);
+        assert_eq!(data.null_buffer().unwrap().len(), 2);

Review comment:
       2 because there are two bytes needed to hold the bitmap?

##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -608,18 +614,28 @@ impl<'a> MutableArrayData<'a> {
     /// This function panics if the range is out of bounds, i.e. if `start + len >= array.len()`.
     pub fn extend(&mut self, index: usize, start: usize, end: usize) {
         let len = end - start;
-        (self.extend_null_bits[index])(&mut self.data, start, len);
+        if !self.extend_null_bits.is_empty() {
+            (self.extend_null_bits[index])(&mut self.data, start, len);

Review comment:
       I believe @tustvold has resolved this in the latest version of this PR -- by making it clear that `use_nulls` must be set to `true` otherwise a `panic` will result if the buffer is extended with nulls




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org