You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/17 09:39:32 UTC

[GitHub] [arrow-rs] HaoYang670 opened a new pull request, #1577: Add utf-8 validation checking for `substring`

HaoYang670 opened a new pull request, #1577:
URL: https://github.com/apache/arrow-rs/pull/1577

   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1575 
   
    If a substring is in the middle of a multibyte char, it will b
   
   # What changes are included in this PR?
   
   Return an error if the boundary of a substring is in the middle of a multibyte char.
   
   # Are there any user-facing changes?
   No.
   
   # Benchmark
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852011677


##########
arrow/benches/string_kernels.rs:
##########
@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) {
     let str_len = 1000;
 
     let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
-    let start = 0;
+    let start = 1;

Review Comment:
   The validation checking is only enabled when `start !=0` or `length != None`. So there is no performance cost for the first micro-bench



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852093474


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   Thank you for your explanation. Let me have a try!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851843714


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852010303


##########
arrow/benches/string_kernels.rs:
##########
@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) {
     let str_len = 1000;
 
     let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
-    let start = 0;
+    let start = 1;

Review Comment:
   Done! And I have updated the benchmark result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852076274


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   My point was that modern optimising compilers will automatically lift conditionals out of loops, in a process called loop unswitching, and that letting it do this can be faster not just from eliminating dynamic dispatch, but also letting it further optimise the code.
   
   I don't feel strongly about it, but thought I'd mention that you might be able to make the code clearer and faster by letting the compiler handle this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852752035


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,37 +36,45 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // Check if `offset` is at a valid char boundary.
+    // If yes, return `offset`, else return error
+    let check_char_boundary = {
+        let data_str = unsafe { std::str::from_utf8_unchecked(data) };

Review Comment:
   ```suggestion
           // Safety: a StringArray must contain valid UTF8 data
           let data_str = unsafe { std::str::from_utf8_unchecked(data) };
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851800150


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))

Review Comment:
   The error message may not be understood by end-users. Maybe just say "The substring begins at index {} is an invalid utf8 char because it is not the first byte in a utf8 code point sequence"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#issuecomment-1100845065

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1577?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1577](https://codecov.io/gh/apache/arrow-rs/pull/1577?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a74dbea) into [master](https://codecov.io/gh/apache/arrow-rs/commit/786792c750b893781939d8d507b464db09ccc634?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (786792c) will **increase** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   > :exclamation: Current head a74dbea differs from pull request most recent head fce157b. Consider uploading reports for the commit fce157b to get more accurate results
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1577   +/-   ##
   =======================================
     Coverage   82.88%   82.88%           
   =======================================
     Files         193      193           
     Lines       55307    55328   +21     
   =======================================
   + Hits        45839    45861   +22     
   + Misses       9468     9467    -1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1577?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow-rs/pull/1577/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9zdWJzdHJpbmcucnM=) | `100.00% <100.00%> (+1.68%)` | :arrow_up: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1577/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1577/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.46% <0.00%> (-0.12%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1577/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.80% <0.00%> (+0.39%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1577?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1577?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [786792c...fce157b](https://codecov.io/gh/apache/arrow-rs/pull/1577?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851844449


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851894731


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string
+    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> Result<OffsetSize>> =
+        match start.cmp(&zero) {
+            // count from the start of string
+            // and check it is a valid char boundary
+            Ordering::Greater => Box::new(|old_start, end| {
+                check_char_boundary((old_start + start).min(end))
+            }),
+            Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),
+            // count from the end of string
+            Ordering::Less => Box::new(|old_start, end| {
+                check_char_boundary((end + start).max(old_start))
+            }),
+        };
+
+    // function for calculating the end index of each string

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851913108


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {

Review Comment:
   Done!



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852514048


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#issuecomment-1101894598

   @viirya @tustvold please review. Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851737222


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought eliminating the dyn-dispatch would be both clearer and faster



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851839980


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -308,4 +325,25 @@ mod tests {
     fn without_nulls_large_string() -> Result<()> {
         without_nulls::<LargeStringArray>()
     }
+
+    #[test]
+    fn check_invalid_array_type() {
+        let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
+        assert_eq!(substring(&array, 0, None).is_err(), true);

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851909480


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -107,29 +127,26 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
 ///
 /// Attention: Both `start` and `length` are counted by byte, not by char.
 ///
-/// # Warning
-///
-/// This function **might** return in invalid utf-8 format if the
-/// character length falls on a non-utf8 boundary, which we
-/// [hope to fix](https://github.com/apache/arrow-rs/issues/1531)
-/// in a future release.
-///
-/// ## Example of getting an invalid substring
+/// # Basic usage
 /// ```
-/// # // Doesn't pass due to  https://github.com/apache/arrow-rs/issues/1531
-/// # #[cfg(not(feature = "force_validate"))]
-/// # {
 /// # use arrow::array::StringArray;
 /// # use arrow::compute::kernels::substring::substring;
-/// let array = StringArray::from(vec![Some("E=mc²")]);
-/// let result = substring(&array, -1, None).unwrap();
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(&4)).unwrap();
 /// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
-/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format
-/// # }
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
 /// ```
 ///
 /// # Error
-/// this function errors when the passed array is not a \[Large\]String array.
+/// - The function errors when the passed array is not a \[Large\]String array.
+/// - The function errors when you try to create a substring in the middle of a multibyte character.

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r852073252


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   The reasons that I use `dyn Fn` are:
   1.  We should do the comparison for `start` and the destruction for `length` outside the for loop. And using closure is a convenient way, otherwise there would be lots of duplicate code.
   2. Closures have different sizes because they capture different environment.
   3. The dynamic dispatching doesn't impact the performance in this function because they are not hotspots.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851734570


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {

Review Comment:
   I wonder if we could instead interpret `data` as `&str` with `std::str::from_utf8_unchecked`, which should be valid if this is a valid string array, and then use the standard library [is_char_boundary](https://doc.rust-lang.org/std/primitive.str.html#method.is_char_boundary) instead of implementing our own



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -59,13 +78,14 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let mut len_so_far = zero;
     new_offsets.push(zero);
 
-    offsets.windows(2).for_each(|pair| {
-        let new_start = cal_new_start(pair[0], pair[1]);
-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {

Review Comment:
   I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought this would be both clearer and faster



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {

Review Comment:
   ```suggestion
           // A valid code-point iff it does not start with 0b10xxxxxx
           // Bit-magic taken from `std::str::is_char_boundary`
           if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 {
   ```
   
   There is a bit-magic trick to save an operator, although it is possible LLVM is smart enough to notice this. I also think it would be better to just use the standard library version directly if possible.



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string
+    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> Result<OffsetSize>> =
+        match start.cmp(&zero) {
+            // count from the start of string
+            // and check it is a valid char boundary
+            Ordering::Greater => Box::new(|old_start, end| {
+                check_char_boundary((old_start + start).min(end))
+            }),
+            Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),

Review Comment:
   :ok_hand: 



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -308,4 +325,25 @@ mod tests {
     fn without_nulls_large_string() -> Result<()> {
         without_nulls::<LargeStringArray>()
     }
+
+    #[test]
+    fn check_invalid_array_type() {
+        let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
+        assert_eq!(substring(&array, 0, None).is_err(), true);

Review Comment:
   I think it would make these tests easier to read, and also potentially less fragile if they checked the error message, e.g. something like
   
   ```
   let err = substring(&array, 0, None).unwrap_err().to_string();
   assert!(err.contains("foo"), "{}", err);
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string
+    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> Result<OffsetSize>> =
+        match start.cmp(&zero) {
+            // count from the start of string
+            // and check it is a valid char boundary
+            Ordering::Greater => Box::new(|old_start, end| {
+                check_char_boundary((old_start + start).min(end))
+            }),
+            Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),
+            // count from the end of string
+            Ordering::Less => Box::new(|old_start, end| {
+                check_char_boundary((end + start).max(old_start))
+            }),
+        };
+
+    // function for calculating the end index of each string

Review Comment:
   Some documentation of this functions inputs and output would be helpful I think



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -35,21 +36,39 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
     let data = values.as_slice();
     let zero = OffsetSize::zero();
 
-    let cal_new_start: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> = if start
-        >= zero
-    {
-        // count from the start of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (old_start + start).min(end))
-    } else {
-        // count from the end of string
-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
+        }
     };
 
-    let cal_new_length: Box<dyn Fn(OffsetSize, OffsetSize) -> OffsetSize> =
-        if let Some(length) = length {
-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string

Review Comment:
   ```suggestion
       // Function for calculating the start index of each string
       // 
       // Takes the start and end offsets of the original string, and returns the byte offset
       // to start copying data from the source values array
       //
       // Returns an error if this offset is not at a valid UTF-8 char boundary
   ```
   
   Or something to that effect, it wasn't clear to me whether this was the offset in the new array or the old



##########
arrow/benches/string_kernels.rs:
##########
@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) {
     let str_len = 1000;
 
     let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
-    let start = 0;
+    let start = 1;

Review Comment:
   Perhaps we could benchmark the various permutations of no start offset and no length?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1577: Add utf-8 validation checking for `substring`

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1577:
URL: https://github.com/apache/arrow-rs/pull/1577#discussion_r851798579


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -107,29 +127,26 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
 ///
 /// Attention: Both `start` and `length` are counted by byte, not by char.
 ///
-/// # Warning
-///
-/// This function **might** return in invalid utf-8 format if the
-/// character length falls on a non-utf8 boundary, which we
-/// [hope to fix](https://github.com/apache/arrow-rs/issues/1531)
-/// in a future release.
-///
-/// ## Example of getting an invalid substring
+/// # Basic usage
 /// ```
-/// # // Doesn't pass due to  https://github.com/apache/arrow-rs/issues/1531
-/// # #[cfg(not(feature = "force_validate"))]
-/// # {
 /// # use arrow::array::StringArray;
 /// # use arrow::compute::kernels::substring::substring;
-/// let array = StringArray::from(vec![Some("E=mc²")]);
-/// let result = substring(&array, -1, None).unwrap();
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(&4)).unwrap();
 /// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
-/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format
-/// # }
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
 /// ```
 ///
 /// # Error
-/// this function errors when the passed array is not a \[Large\]String array.
+/// - The function errors when the passed array is not a \[Large\]String array.
+/// - The function errors when you try to create a substring in the middle of a multibyte character.

Review Comment:
   Hm, multibyte character sounds a bit vague, maybe just say invalid utf-8 format?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org