You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/22 14:23:24 UTC

[GitHub] [arrow-rs] alamb commented on a diff in pull request #4442: unify substring for binary&utf8

alamb commented on code in PR #4442:
URL: https://github.com/apache/arrow-rs/pull/4442#discussion_r1238589605


##########
arrow-string/src/substring.rs:
##########
@@ -246,36 +247,59 @@ fn get_start_end_offset(
     (start_offset, end_offset)
 }
 
-fn binary_substring<OffsetSize: OffsetSizeTrait>(
-    array: &GenericBinaryArray<OffsetSize>,
-    start: OffsetSize,
-    length: Option<OffsetSize>,
-) -> Result<ArrayRef, ArrowError> {
+fn byte_substring<T: ByteArrayType>(
+    array: &GenericByteArray<T>,
+    start: T::Offset,
+    length: Option<T::Offset>,
+) -> Result<ArrayRef, ArrowError>
+where
+    <T as ByteArrayType>::Native: PartialEq,
+{
     let offsets = array.value_offsets();
     let data = array.value_data();
-    let zero = OffsetSize::zero();
+    let zero = <T::Offset as Zero>::zero();
+
+    // Check if `offset` is at a valid char boundary.
+    // If yes, return `offset`, else return error
+    let check_char_boundary = {
+        // Safety: a StringArray must contain valid UTF8 data
+        let data_str = unsafe { std::str::from_utf8_unchecked(data) };
+        |offset: T::Offset| {
+            let offset_usize = offset.as_usize();
+            if data_str.is_char_boundary(offset_usize) {
+                Ok(offset)
+            } else {
+                Err(ArrowError::ComputeError(format!(
+                    "The offset {offset_usize} is at an invalid boundary."

Review Comment:
   I think we should continue to include 'utf-8' in the error message 
   
   ```suggestion
                       "The offset {offset_usize} is at an invalid utf-8 boundary."
   ```



##########
arrow-string/src/substring.rs:
##########
@@ -246,36 +247,59 @@ fn get_start_end_offset(
     (start_offset, end_offset)
 }
 
-fn binary_substring<OffsetSize: OffsetSizeTrait>(
-    array: &GenericBinaryArray<OffsetSize>,
-    start: OffsetSize,
-    length: Option<OffsetSize>,
-) -> Result<ArrayRef, ArrowError> {
+fn byte_substring<T: ByteArrayType>(
+    array: &GenericByteArray<T>,
+    start: T::Offset,
+    length: Option<T::Offset>,
+) -> Result<ArrayRef, ArrowError>
+where
+    <T as ByteArrayType>::Native: PartialEq,
+{
     let offsets = array.value_offsets();
     let data = array.value_data();
-    let zero = OffsetSize::zero();
+    let zero = <T::Offset as Zero>::zero();
+
+    // Check if `offset` is at a valid char boundary.

Review Comment:
   I may be misunderstanding something, but I think utf8 checking should only be done for `StringArray` and `LargeStringArray`, not `BinaryArray` and `LargeBinaryArray` which can each contain non utf-8 data
   
   From my reading of this code it will always check that we are not splitting across utf8 boundaries
   
   However, given there are no tests that fail, perhaps we do not have coverage for this case 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org