You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/25 13:49:52 UTC

[GitHub] [arrow] jhorstmann commented on a change in pull request #8260: ARROW-10084: [Rust] [DataFusion] Added length of LargeStringArray and fixed undefined behavior.

jhorstmann commented on a change in pull request #8260:
URL: https://github.com/apache/arrow/pull/8260#discussion_r494974111



##########
File path: rust/arrow/src/compute/kernels/length.rs
##########
@@ -17,52 +17,56 @@
 
 //! Defines kernel for length of a string array
 
-use crate::array::*;
+use crate::datatypes::ToByteSlice;
+use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
-    datatypes::UInt32Type,
     error::{ArrowError, Result},
 };
 use std::sync::Arc;
 
-/// Returns an array of UInt32 denoting the number of characters in each string in the array.
+fn length_string<OffsetSize>(array: &Array, data_type: DataType) -> Result<ArrayRef>
+where
+    OffsetSize: OffsetSizeTrait,
+{
+    // note: offsets are stored as u8, but they can be interpreted as OffsetSize
+    let offsets = array.data_ref().clone().buffers()[0].clone();
+    // this is a 30% improvement over iterating over u8s and building OffsetSize, which
+    // justifies the usage of `unsafe`.
+    let slice: &[OffsetSize] = unsafe { offsets.typed_data::<OffsetSize>() };

Review comment:
       To support sliced arrays this needs to take the offset of the array into account. The following should work, but a testcase would be nice:
   
   ```suggestion
       let slice: &[OffsetSize] = unsafe { offsets.typed_data::<OffsetSize>() }[array.offset()..];
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org