You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/17 10:45:43 UTC

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

tustvold commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851734570


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {

Review Comment:
   I wonder if we could instead interpret `data` as `&str` with `std::str::from_utf8_unchecked`, which should be valid if this is a valid string array, and then use the standard library [is_char_boundary](https://doc.rust-lang.org/std/primitive.str.html#method.is_char_boundary) instead of implementing our own



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought this would be both clearer and faster



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {

Review Comment:
   ```suggestion
           // A valid code-point iff it does not start with 0b10xxxxxx
           // Bit-magic taken from `std::str::is_char_boundary`
           if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 {
   ```
   
   There is a bit-magic trick to save an operator, although it is possible LLVM is smart enough to notice this. I also think it would be better to just use the standard library version directly if possible.



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string
+    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> Result<OffsetSize>> =
+        match start.cmp(&zero) {
+            // count from the start of string
+            // and check it is a valid char boundary
+            Ordering::Greater => Box::new(|old_start, end| {
+                check_char_boundary((old_start + start).min(end))
+            }),
+            Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),

Review Comment:
   :ok_hand: 



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -308,4 +325,25 @@ mod tests {
     fn without_nulls_large_string() -> Result<()> {
         without_nulls::<LargeStringArray>()
     }
+
+    #[test]
+    fn check_invalid_array_type() {
+        let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
+        assert_eq!(substring(&array, 0, None).is_err(), true);

Review Comment:
   I think it would make these tests easier to read, and also potentially less fragile if they checked the error message, e.g. something like
   
   ```
   let err = substring(&array, 0, None).unwrap_err().to_string();
   assert!(err.contains("foo"), "{}", err);
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string
+    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> Result<OffsetSize>> =
+        match start.cmp(&zero) {
+            // count from the start of string
+            // and check it is a valid char boundary
+            Ordering::Greater => Box::new(|old_start, end| {
+                check_char_boundary((old_start + start).min(end))
+            }),
+            Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),
+            // count from the end of string
+            Ordering::Less => Box::new(|old_start, end| {
+                check_char_boundary((end + start).max(old_start))
+            }),
+        };
+
+    // function for calculating the end index of each string

Review Comment:
   Some documentation of this functions inputs and output would be helpful I think



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string

Review Comment:
   ```suggestion
       // Function for calculating the start index of each string
       // 
       // Takes the start and end offsets of the original string, and returns the byte offset
       // to start copying data from the source values array
       //
       // Returns an error if this offset is not at a valid UTF-8 char boundary
   ```
   
   Or something to that effect, it wasn't clear to me whether this was the offset in the new array or the old



##########
arrow/benches/string_kernels.rs:
##########
@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) {
     let str_len = 1000;
 
     let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
-    let start = 0;
+    let start = 1;

Review Comment:
   Perhaps we could benchmark the various permutations of no start offset and no length?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org