You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jackwener (via GitHub)" <gi...@apache.org> on 2023/06/22 06:30:33 UTC

[GitHub] [arrow-rs] jackwener opened a new pull request, #4442: unify substring for binary&utf8

jackwener opened a new pull request, #4442:
URL: https://github.com/apache/arrow-rs/pull/4442

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   After arrow-rs own `GenericByteArray`, we can unify `substring` of `binary` & `utf8`.
   
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jackwener commented on pull request #4442: unify substring for binary&utf8

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on PR #4442:
URL: https://github.com/apache/arrow-rs/pull/4442#issuecomment-1603119097

   > My only concern is about utf8 boundary checking
   
   Thanks for @alamb review, I have added code to check just when array is StringArray/LargeStringArray


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4442: unify substring for binary&utf8

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4442:
URL: https://github.com/apache/arrow-rs/pull/4442#discussion_r1240008635


##########
arrow-string/src/substring.rs:
##########
@@ -1020,4 +968,17 @@ mod tests {
         let err = substring(&array, 0, Some(5)).unwrap_err().to_string();
         assert!(err.contains("invalid utf-8 boundary"));
     }
+
+    #[test]
+    fn non_utf8_bytes() {

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #4442: unify substring for binary&utf8

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #4442:
URL: https://github.com/apache/arrow-rs/pull/4442


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #4442: unify substring for binary&utf8

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #4442:
URL: https://github.com/apache/arrow-rs/pull/4442#discussion_r1238589605


##########
arrow-string/src/substring.rs:
##########
@@ -246,36 +247,59 @@ fn get_start_end_offset(
     (start_offset, end_offset)
 }
 
-fn binary_substring<OffsetSize: OffsetSizeTrait>(
-    array: &GenericBinaryArray<OffsetSize>,
-    start: OffsetSize,
-    length: Option<OffsetSize>,
-) -> Result<ArrayRef, ArrowError> {
+fn byte_substring<T: ByteArrayType>(
+    array: &GenericByteArray<T>,
+    start: T::Offset,
+    length: Option<T::Offset>,
+) -> Result<ArrayRef, ArrowError>
+where
+    <T as ByteArrayType>::Native: PartialEq,
+{
     let offsets = array.value_offsets();
     let data = array.value_data();
-    let zero = OffsetSize::zero();
+    let zero = <T::Offset as Zero>::zero();
+
+    // Check if `offset` is at a valid char boundary.
+    // If yes, return `offset`, else return error
+    let check_char_boundary = {
+        // Safety: a StringArray must contain valid UTF8 data
+        let data_str = unsafe { std::str::from_utf8_unchecked(data) };
+        |offset: T::Offset| {
+            let offset_usize = offset.as_usize();
+            if data_str.is_char_boundary(offset_usize) {
+                Ok(offset)
+            } else {
+                Err(ArrowError::ComputeError(format!(
+                    "The offset {offset_usize} is at an invalid boundary."

Review Comment:
   I think we should continue to include 'utf-8' in the error message 
   
   ```suggestion
                       "The offset {offset_usize} is at an invalid utf-8 boundary."
   ```



##########
arrow-string/src/substring.rs:
##########
@@ -246,36 +247,59 @@ fn get_start_end_offset(
     (start_offset, end_offset)
 }
 
-fn binary_substring<OffsetSize: OffsetSizeTrait>(
-    array: &GenericBinaryArray<OffsetSize>,
-    start: OffsetSize,
-    length: Option<OffsetSize>,
-) -> Result<ArrayRef, ArrowError> {
+fn byte_substring<T: ByteArrayType>(
+    array: &GenericByteArray<T>,
+    start: T::Offset,
+    length: Option<T::Offset>,
+) -> Result<ArrayRef, ArrowError>
+where
+    <T as ByteArrayType>::Native: PartialEq,
+{
     let offsets = array.value_offsets();
     let data = array.value_data();
-    let zero = OffsetSize::zero();
+    let zero = <T::Offset as Zero>::zero();
+
+    // Check if `offset` is at a valid char boundary.

Review Comment:
   I may be misunderstanding something, but I think utf8 checking should only be done for `StringArray` and `LargeStringArray`, not `BinaryArray` and `LargeBinaryArray` which can each contain non utf-8 data
   
   From my reading of this code it will always check that we are not splitting across utf8 boundaries
   
   However, given there are no tests that fail, perhaps we do not have coverage for this case 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] jackwener commented on a diff in pull request #4442: unify substring for binary&utf8

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #4442:
URL: https://github.com/apache/arrow-rs/pull/4442#discussion_r1238880350


##########
arrow-string/src/substring.rs:
##########
@@ -246,36 +247,59 @@ fn get_start_end_offset(
     (start_offset, end_offset)
 }
 
-fn binary_substring<OffsetSize: OffsetSizeTrait>(
-    array: &GenericBinaryArray<OffsetSize>,
-    start: OffsetSize,
-    length: Option<OffsetSize>,
-) -> Result<ArrayRef, ArrowError> {
+fn byte_substring<T: ByteArrayType>(
+    array: &GenericByteArray<T>,
+    start: T::Offset,
+    length: Option<T::Offset>,
+) -> Result<ArrayRef, ArrowError>
+where
+    <T as ByteArrayType>::Native: PartialEq,
+{
     let offsets = array.value_offsets();
     let data = array.value_data();
-    let zero = OffsetSize::zero();
+    let zero = <T::Offset as Zero>::zero();
+
+    // Check if `offset` is at a valid char boundary.

Review Comment:
   Thanks for @alamb review



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org