You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/02 17:26:54 UTC

[GitHub] [arrow-rs] jhorstmann commented on a change in pull request #382: make sure that only concat preallocates buffers

jhorstmann commented on a change in pull request #382:
URL: https://github.com/apache/arrow-rs/pull/382#discussion_r644175597



##########
File path: arrow/src/array/transform/mod.rs
##########
@@ -338,33 +338,61 @@ fn preallocate_str_buffer<Offset: StringOffsetSizeTrait>(
     } else {
         buffer.push(0i32)
     }
-    let str_values_size = arrays
-        .iter()
-        .map(|data| {
-            // get the length of the value buffer
-            let buf_len = data.buffers()[1].len();
-            // find the offset of the buffer
-            // this returns a slice of offsets, starting from the offset of the array
-            // so we can take the first value
-            let offset = data.buffer::<Offset>(0)[0];
-            buf_len - offset.to_usize().unwrap()
-        })
-        .sum::<usize>();
 
     [
         buffer,
-        MutableBuffer::new(str_values_size * mem::size_of::<u8>()),
+        MutableBuffer::new(binary_size * mem::size_of::<u8>()),
     ]
 }
 
+/// Define capacities of child data or data buffers.
+#[derive(Debug, Clone)]
+pub enum Capacities {
+    /// Binary, Utf8 and LargeUtf8 data types
+    /// Define
+    /// * the capacity of the array offsets
+    /// * the capacity of the binary/ str buffer
+    Binary(usize, Option<usize>),
+    /// List and LargeList data types
+    /// Define
+    /// * the capacity of the array offsets
+    /// * the capacity of the child data
+    List(usize, Option<Box<Capacities>>),
+    /// Struct type
+    /// * the capacity of the array
+    /// * the capacities of the fields
+    Struct(usize, Option<Vec<Capacities>>),
+    /// Dictionary type
+    /// * the capacity of the array

Review comment:
       What is the difference here between the "capacity of the array" and "the capacity of the keys"?
   
   I have a rather special usage of concat for dictionary arrays, where I can guarantee that the dictionary values of all arrays are the same, or at least that the dictionary of the last array includes all previous dictionaries in the same order. I wonder whether that could be supported here and also fix the current concatenation logic that could lead to duplicated dictionary values. The rough idea would be to have another field here with an
   ```
   enum DictionaryMergeStrategy {
       Append(Option<usize>), // current logic, just append dictionary values with an optional capacity
       Merge(Option<usize>), // actually merge the dictionaries using a hashmap, to be imlemented in a followup
       UseLastValues, // use the values of the last dictionary array, might want to validate that this has a size at least as large as all other dictionaries so all keys are guaranteed to be in range
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org