You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/04 09:15:18 UTC

[GitHub] [arrow-rs] HaoYang670 opened a new pull request, #1784: Add `Substring_by_char`

HaoYang670 opened a new pull request, #1784:
URL: https://github.com/apache/arrow-rs/pull/1784

   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1768.
   
   # Rationale for this change
   Support substring by char
   
   # What changes are included in this PR?
   1. new pub function `substring_by_char`
   2. tests
   3. benchmark
   4. update docs
   
   # Are there any user-facing changes?
   Yes.
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1784: Add `Substring_by_char`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1784:
URL: https://github.com/apache/arrow-rs/pull/1784#discussion_r890102648


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -1083,6 +1135,164 @@ mod tests {
         generic_string_with_non_zero_offset::<i64>()
     }
 
+    fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec![Some("hello"), None, Some("Γ ⊢x:T")];
+        let cases = vec![
+            // all-nulls array is always identical
+            (vec![None, None, None], 0, None, vec![None, None, None]),
+            // identity
+            (
+                input_vals.clone(),
+                0,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // 0 length -> Nothing
+            (
+                input_vals.clone(),
+                0,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high start -> Nothing
+            (
+                input_vals.clone(),
+                1000,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high negative start -> identity
+            (
+                input_vals.clone(),
+                -1000,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // high length -> identity
+            (
+                input_vals.clone(),
+                0,
+                Some(1000),
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(
+            |(array, start, length, expected)| {
+                let array = GenericStringArray::<O>::from(array);
+                let result = substring_by_char(&array, start, length)?;
+                assert_eq!(array.len(), result.len());
+
+                let expected = GenericStringArray::<O>::from(expected);
+                assert_eq!(expected, result);
+                Ok(())
+            },
+        )?;
+
+        Ok(())
+    }
+
+    #[test]
+    fn with_nulls_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i32>()
+    }
+
+    #[test]
+    fn with_nulls_large_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i64>()
+    }
+
+    fn without_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec!["hello", "", "Γ ⊢x:T"];
+        let cases = vec![
+            // empty array is always identical
+            (vec!["", "", ""], 0, None, vec!["", "", ""]),
+            // increase start
+            (input_vals.clone(), 0, None, vec!["hello", "", "Γ ⊢x:T"]),
+            (input_vals.clone(), 1, None, vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), 2, None, vec!["llo", "", "⊢x:T"]),
+            (input_vals.clone(), 3, None, vec!["lo", "", "x:T"]),
+            (input_vals.clone(), 10, None, vec!["", "", ""]),
+            // increase start negatively
+            (input_vals.clone(), -1, None, vec!["o", "", "T"]),
+            (input_vals.clone(), -2, None, vec!["lo", "", ":T"]),
+            (input_vals.clone(), -4, None, vec!["ello", "", "⊢x:T"]),
+            (input_vals.clone(), -10, None, vec!["hello", "", "Γ ⊢x:T"]),
+            // increase length
+            (input_vals.clone(), 1, Some(1), vec!["e", "", " "]),
+            (input_vals.clone(), 1, Some(2), vec!["el", "", " ⊢"]),
+            (input_vals.clone(), 1, Some(3), vec!["ell", "", " ⊢x"]),
+            (input_vals.clone(), 1, Some(6), vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), -4, Some(1), vec!["e", "", "⊢"]),
+            (input_vals.clone(), -4, Some(2), vec!["el", "", "⊢x"]),
+            (input_vals.clone(), -4, Some(3), vec!["ell", "", "⊢x:"]),
+            (input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(

Review Comment:
   Tracked by #1801



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -1083,6 +1135,164 @@ mod tests {
         generic_string_with_non_zero_offset::<i64>()
     }
 
+    fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec![Some("hello"), None, Some("Γ ⊢x:T")];
+        let cases = vec![
+            // all-nulls array is always identical
+            (vec![None, None, None], 0, None, vec![None, None, None]),
+            // identity
+            (
+                input_vals.clone(),
+                0,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // 0 length -> Nothing
+            (
+                input_vals.clone(),
+                0,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high start -> Nothing
+            (
+                input_vals.clone(),
+                1000,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high negative start -> identity
+            (
+                input_vals.clone(),
+                -1000,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // high length -> identity
+            (
+                input_vals.clone(),
+                0,
+                Some(1000),
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(
+            |(array, start, length, expected)| {
+                let array = GenericStringArray::<O>::from(array);
+                let result = substring_by_char(&array, start, length)?;
+                assert_eq!(array.len(), result.len());
+
+                let expected = GenericStringArray::<O>::from(expected);
+                assert_eq!(expected, result);
+                Ok(())
+            },
+        )?;
+
+        Ok(())
+    }
+
+    #[test]
+    fn with_nulls_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i32>()
+    }
+
+    #[test]
+    fn with_nulls_large_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i64>()
+    }
+
+    fn without_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec!["hello", "", "Γ ⊢x:T"];
+        let cases = vec![
+            // empty array is always identical
+            (vec!["", "", ""], 0, None, vec!["", "", ""]),
+            // increase start
+            (input_vals.clone(), 0, None, vec!["hello", "", "Γ ⊢x:T"]),
+            (input_vals.clone(), 1, None, vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), 2, None, vec!["llo", "", "⊢x:T"]),
+            (input_vals.clone(), 3, None, vec!["lo", "", "x:T"]),
+            (input_vals.clone(), 10, None, vec!["", "", ""]),
+            // increase start negatively
+            (input_vals.clone(), -1, None, vec!["o", "", "T"]),
+            (input_vals.clone(), -2, None, vec!["lo", "", ":T"]),
+            (input_vals.clone(), -4, None, vec!["ello", "", "⊢x:T"]),
+            (input_vals.clone(), -10, None, vec!["hello", "", "Γ ⊢x:T"]),
+            // increase length
+            (input_vals.clone(), 1, Some(1), vec!["e", "", " "]),
+            (input_vals.clone(), 1, Some(2), vec!["el", "", " ⊢"]),
+            (input_vals.clone(), 1, Some(3), vec!["ell", "", " ⊢x"]),
+            (input_vals.clone(), 1, Some(6), vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), -4, Some(1), vec!["e", "", "⊢"]),
+            (input_vals.clone(), -4, Some(2), vec!["el", "", "⊢x"]),
+            (input_vals.clone(), -4, Some(3), vec!["ell", "", "⊢x:"]),
+            (input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(
+            |(array, start, length, expected)| {
+                let array = GenericStringArray::<O>::from(array);
+                let result = substring_by_char(&array, start, length)?;
+                assert_eq!(array.len(), result.len());
+                let expected = GenericStringArray::<O>::from(expected);
+                assert_eq!(expected, result);
+                Ok(())
+            },
+        )?;
+
+        Ok(())
+    }
+
+    #[test]
+    fn without_nulls_string_by_char() -> Result<()> {
+        without_nulls_generic_string_by_char::<i32>()
+    }
+
+    #[test]
+    fn without_nulls_large_string_by_char() -> Result<()> {
+        without_nulls_generic_string_by_char::<i64>()
+    }
+
+    fn generic_string_by_char_with_non_zero_offset<O: OffsetSizeTrait>() -> Result<()> {
+        let values = "S→T = Πx:S.T";

Review Comment:
   Tracked by #1801



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1784: Add `Substring_by_char`

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1784:
URL: https://github.com/apache/arrow-rs/pull/1784#discussion_r890100818


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -150,6 +152,56 @@ pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<A
     }
 }
 
+/// # Arguments
+/// * `array` - The input string array
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by char.
+///
+/// # Performance
+/// This function is slower than [substring].
+/// Theoretically, the time complexity is `O(n)` where `n` is the length of the value buffer.
+/// It is recommended to use [substring] if the input array only contains ASCII chars.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring_by_char;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("Γ ⊢x:T")]);
+/// let result = substring_by_char(&array, 1, Some(4)).unwrap();
+/// assert_eq!(result, StringArray::from(vec![Some("rrow"), None, Some(" ⊢x:")]));
+/// ```
+pub fn substring_by_char<OffsetSize: OffsetSizeTrait>(
+    array: &GenericStringArray<OffsetSize>,
+    start: i64,
+    length: Option<u64>,
+) -> Result<GenericStringArray<OffsetSize>> {
+    Ok(array
+        .iter()
+        .map(|val| {
+            val.map(|val| {
+                let char_count = val.chars().count();
+                let start = if start >= 0 {
+                    start.to_usize().unwrap().min(char_count)
+                } else {
+                    char_count - (-start).to_usize().unwrap().min(char_count)
+                };
+                let length = length.map_or(char_count - start, |length| {
+                    length.to_usize().unwrap().min(char_count - start)
+                });
+
+                val.chars().skip(start).take(length).collect::<String>()

Review Comment:
   Tracked by #1800



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #1784: Add `Substring_by_char`

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1784:
URL: https://github.com/apache/arrow-rs/pull/1784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1784: Add `Substring_by_char`

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1784:
URL: https://github.com/apache/arrow-rs/pull/1784#issuecomment-1146577763

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1784?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1784](https://codecov.io/gh/apache/arrow-rs/pull/1784?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f91ff1b) into [master](https://codecov.io/gh/apache/arrow-rs/commit/1a64677b7e368b1fed586d4f2d590eaefc5a7e1e?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1a64677) will **increase** coverage by `0.02%`.
   > The diff coverage is `100.00%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1784      +/-   ##
   ==========================================
   + Coverage   83.39%   83.42%   +0.02%     
   ==========================================
     Files         198      198              
     Lines       56142    56269     +127     
   ==========================================
   + Hits        46822    46940     +118     
   - Misses       9320     9329       +9     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1784?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9zdWJzdHJpbmcucnM=) | `99.51% <100.00%> (+0.09%)` | :arrow_up: |
   | [parquet/src/util/cursor.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9jdXJzb3IucnM=) | `62.18% <0.00%> (-1.69%)` | :arrow_down: |
   | [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `94.46% <0.00%> (-1.17%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.53% <0.00%> (-0.69%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.42% <0.00%> (-0.38%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.46% <0.00%> (-0.20%)` | :arrow_down: |
   | [parquet/src/file/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS93cml0ZXIucnM=) | `92.84% <0.00%> (-0.02%)` | :arrow_down: |
   | [parquet/src/arrow/arrow\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfd3JpdGVyLnJz) | `97.76% <0.00%> (-0.02%)` | :arrow_down: |
   | [parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz) | `96.81% <0.00%> (-0.01%)` | :arrow_down: |
   | [arrow/src/ffi.rs](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2ZmaS5ycw==) | `86.97% <0.00%> (ø)` | |
   | ... and [10 more](https://codecov.io/gh/apache/arrow-rs/pull/1784/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1784?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1784?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [1a64677...f91ff1b](https://codecov.io/gh/apache/arrow-rs/pull/1784?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1784: Add `Substring_by_char`

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1784:
URL: https://github.com/apache/arrow-rs/pull/1784#discussion_r890006811


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -29,7 +30,7 @@ use crate::{
 use std::cmp::Ordering;
 use std::sync::Arc;
 
-/// Returns an ArrayRef with substrings of all the elements in `array`.
+/// Returns an [ArrayRef] with substrings of all the elements in `array`.

Review Comment:
   ```suggestion
   /// Returns an [`ArrayRef`] with substrings of all the elements in `array`.
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -53,9 +54,10 @@ use std::sync::Arc;
 /// ```
 ///
 /// # Error
-/// - The function errors when the passed array is not a \[Large\]String array, \[Large\]Binary
-///   array, or DictionaryArray with \[Large\]String or \[Large\]Binary as its value type.
+/// - The function errors when the passed array is not a [GenericStringArray], [GenericBinaryArray], [FixedSizeBinaryArray]
+///   or [DictionaryArray] with supported array type as its value type.

Review Comment:
   ```suggestion
   ///   or [`DictionaryArray`] with supported array type as its value type.
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -53,9 +54,10 @@ use std::sync::Arc;
 /// ```
 ///
 /// # Error
-/// - The function errors when the passed array is not a \[Large\]String array, \[Large\]Binary
-///   array, or DictionaryArray with \[Large\]String or \[Large\]Binary as its value type.
+/// - The function errors when the passed array is not a [GenericStringArray], [GenericBinaryArray], [FixedSizeBinaryArray]

Review Comment:
   ```suggestion
   /// - The function errors when the passed array is not a [`GenericStringArray`], [`GenericBinaryArray`], [`FixedSizeBinaryArray`]
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -1083,6 +1135,164 @@ mod tests {
         generic_string_with_non_zero_offset::<i64>()
     }
 
+    fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec![Some("hello"), None, Some("Γ ⊢x:T")];
+        let cases = vec![
+            // all-nulls array is always identical
+            (vec![None, None, None], 0, None, vec![None, None, None]),
+            // identity
+            (
+                input_vals.clone(),
+                0,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // 0 length -> Nothing
+            (
+                input_vals.clone(),
+                0,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high start -> Nothing
+            (
+                input_vals.clone(),
+                1000,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high negative start -> identity
+            (
+                input_vals.clone(),
+                -1000,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // high length -> identity
+            (
+                input_vals.clone(),
+                0,
+                Some(1000),
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(
+            |(array, start, length, expected)| {
+                let array = GenericStringArray::<O>::from(array);
+                let result = substring_by_char(&array, start, length)?;
+                assert_eq!(array.len(), result.len());
+
+                let expected = GenericStringArray::<O>::from(expected);
+                assert_eq!(expected, result);
+                Ok(())
+            },
+        )?;
+
+        Ok(())
+    }
+
+    #[test]
+    fn with_nulls_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i32>()
+    }
+
+    #[test]
+    fn with_nulls_large_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i64>()
+    }
+
+    fn without_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec!["hello", "", "Γ ⊢x:T"];
+        let cases = vec![
+            // empty array is always identical
+            (vec!["", "", ""], 0, None, vec!["", "", ""]),
+            // increase start
+            (input_vals.clone(), 0, None, vec!["hello", "", "Γ ⊢x:T"]),
+            (input_vals.clone(), 1, None, vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), 2, None, vec!["llo", "", "⊢x:T"]),
+            (input_vals.clone(), 3, None, vec!["lo", "", "x:T"]),
+            (input_vals.clone(), 10, None, vec!["", "", ""]),
+            // increase start negatively
+            (input_vals.clone(), -1, None, vec!["o", "", "T"]),
+            (input_vals.clone(), -2, None, vec!["lo", "", ":T"]),
+            (input_vals.clone(), -4, None, vec!["ello", "", "⊢x:T"]),
+            (input_vals.clone(), -10, None, vec!["hello", "", "Γ ⊢x:T"]),
+            // increase length
+            (input_vals.clone(), 1, Some(1), vec!["e", "", " "]),
+            (input_vals.clone(), 1, Some(2), vec!["el", "", " ⊢"]),
+            (input_vals.clone(), 1, Some(3), vec!["ell", "", " ⊢x"]),
+            (input_vals.clone(), 1, Some(6), vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), -4, Some(1), vec!["e", "", "⊢"]),
+            (input_vals.clone(), -4, Some(2), vec!["el", "", "⊢x"]),
+            (input_vals.clone(), -4, Some(3), vec!["ell", "", "⊢x:"]),
+            (input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(

Review Comment:
   Perhaps we could extract this as a function instead of duplicating it?



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -53,9 +54,10 @@ use std::sync::Arc;
 /// ```
 ///
 /// # Error
-/// - The function errors when the passed array is not a \[Large\]String array, \[Large\]Binary
-///   array, or DictionaryArray with \[Large\]String or \[Large\]Binary as its value type.
+/// - The function errors when the passed array is not a [GenericStringArray], [GenericBinaryArray], [FixedSizeBinaryArray]
+///   or [DictionaryArray] with supported array type as its value type.
 /// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+/// It is recommended to use [substring_by_char] if the input array contains non-ASCII chars.

Review Comment:
   ```suggestion
   /// It is recommended to use [`substring_by_char`] if the input array may contain non-ASCII chars.
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -1083,6 +1135,164 @@ mod tests {
         generic_string_with_non_zero_offset::<i64>()
     }
 
+    fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec![Some("hello"), None, Some("Γ ⊢x:T")];
+        let cases = vec![
+            // all-nulls array is always identical
+            (vec![None, None, None], 0, None, vec![None, None, None]),
+            // identity
+            (
+                input_vals.clone(),
+                0,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // 0 length -> Nothing
+            (
+                input_vals.clone(),
+                0,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high start -> Nothing
+            (
+                input_vals.clone(),
+                1000,
+                Some(0),
+                vec![Some(""), None, Some("")],
+            ),
+            // high negative start -> identity
+            (
+                input_vals.clone(),
+                -1000,
+                None,
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+            // high length -> identity
+            (
+                input_vals.clone(),
+                0,
+                Some(1000),
+                vec![Some("hello"), None, Some("Γ ⊢x:T")],
+            ),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(
+            |(array, start, length, expected)| {
+                let array = GenericStringArray::<O>::from(array);
+                let result = substring_by_char(&array, start, length)?;
+                assert_eq!(array.len(), result.len());
+
+                let expected = GenericStringArray::<O>::from(expected);
+                assert_eq!(expected, result);
+                Ok(())
+            },
+        )?;
+
+        Ok(())
+    }
+
+    #[test]
+    fn with_nulls_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i32>()
+    }
+
+    #[test]
+    fn with_nulls_large_string_by_char() -> Result<()> {
+        with_nulls_generic_string_by_char::<i64>()
+    }
+
+    fn without_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {
+        let input_vals = vec!["hello", "", "Γ ⊢x:T"];
+        let cases = vec![
+            // empty array is always identical
+            (vec!["", "", ""], 0, None, vec!["", "", ""]),
+            // increase start
+            (input_vals.clone(), 0, None, vec!["hello", "", "Γ ⊢x:T"]),
+            (input_vals.clone(), 1, None, vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), 2, None, vec!["llo", "", "⊢x:T"]),
+            (input_vals.clone(), 3, None, vec!["lo", "", "x:T"]),
+            (input_vals.clone(), 10, None, vec!["", "", ""]),
+            // increase start negatively
+            (input_vals.clone(), -1, None, vec!["o", "", "T"]),
+            (input_vals.clone(), -2, None, vec!["lo", "", ":T"]),
+            (input_vals.clone(), -4, None, vec!["ello", "", "⊢x:T"]),
+            (input_vals.clone(), -10, None, vec!["hello", "", "Γ ⊢x:T"]),
+            // increase length
+            (input_vals.clone(), 1, Some(1), vec!["e", "", " "]),
+            (input_vals.clone(), 1, Some(2), vec!["el", "", " ⊢"]),
+            (input_vals.clone(), 1, Some(3), vec!["ell", "", " ⊢x"]),
+            (input_vals.clone(), 1, Some(6), vec!["ello", "", " ⊢x:T"]),
+            (input_vals.clone(), -4, Some(1), vec!["e", "", "⊢"]),
+            (input_vals.clone(), -4, Some(2), vec!["el", "", "⊢x"]),
+            (input_vals.clone(), -4, Some(3), vec!["ell", "", "⊢x:"]),
+            (input_vals.clone(), -4, Some(4), vec!["ello", "", "⊢x:T"]),
+        ];
+
+        cases.into_iter().try_for_each::<_, Result<()>>(
+            |(array, start, length, expected)| {
+                let array = GenericStringArray::<O>::from(array);
+                let result = substring_by_char(&array, start, length)?;
+                assert_eq!(array.len(), result.len());
+                let expected = GenericStringArray::<O>::from(expected);
+                assert_eq!(expected, result);
+                Ok(())
+            },
+        )?;
+
+        Ok(())
+    }
+
+    #[test]
+    fn without_nulls_string_by_char() -> Result<()> {
+        without_nulls_generic_string_by_char::<i32>()
+    }
+
+    #[test]
+    fn without_nulls_large_string_by_char() -> Result<()> {
+        without_nulls_generic_string_by_char::<i64>()
+    }
+
+    fn generic_string_by_char_with_non_zero_offset<O: OffsetSizeTrait>() -> Result<()> {
+        let values = "S→T = Πx:S.T";

Review Comment:
   I think it would be easier to follow to just construct the regular array, and then call `array.slice(1, 2)`



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -1083,6 +1135,164 @@ mod tests {
         generic_string_with_non_zero_offset::<i64>()
     }
 
+    fn with_nulls_generic_string_by_char<O: OffsetSizeTrait>() -> Result<()> {

Review Comment:
   Loving the test coverage :+1:



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -150,6 +152,56 @@ pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<A
     }
 }
 
+/// # Arguments
+/// * `array` - The input string array
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by char.
+///
+/// # Performance
+/// This function is slower than [substring].
+/// Theoretically, the time complexity is `O(n)` where `n` is the length of the value buffer.
+/// It is recommended to use [substring] if the input array only contains ASCII chars.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring_by_char;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("Γ ⊢x:T")]);
+/// let result = substring_by_char(&array, 1, Some(4)).unwrap();
+/// assert_eq!(result, StringArray::from(vec![Some("rrow"), None, Some(" ⊢x:")]));
+/// ```
+pub fn substring_by_char<OffsetSize: OffsetSizeTrait>(
+    array: &GenericStringArray<OffsetSize>,
+    start: i64,
+    length: Option<u64>,
+) -> Result<GenericStringArray<OffsetSize>> {
+    Ok(array
+        .iter()
+        .map(|val| {
+            val.map(|val| {
+                let char_count = val.chars().count();
+                let start = if start >= 0 {
+                    start.to_usize().unwrap().min(char_count)
+                } else {
+                    char_count - (-start).to_usize().unwrap().min(char_count)
+                };
+                let length = length.map_or(char_count - start, |length| {
+                    length.to_usize().unwrap().min(char_count - start)
+                });
+
+                val.chars().skip(start).take(length).collect::<String>()

Review Comment:
   Allocating a string temporary, only to copy out of it, is likely a significant portion of the slow-down. That combined with the null handling.
   
   This could definitely be handled as a separate PR, but you might want to consider doing something like (not tested).
   
   ```
   let nulls = // align bitmap to 0 offset, copying if already aligned
   let mut vals = BufferBuilder::<u8>::new(array.value_data().len());
   let mut indexes = BufferBuilder::<OffsetSize>::new(array.len() + 1);
   indexes.append(0);
   
   for val in array.iter() {
     let char_count = val.chars().count();
     let start = if start >= 0 {
         start.to_usize().unwrap().min(char_count)
     } else {
         char_count - (-start).to_usize().unwrap().min(char_count)
     };
     let length = length.map_or(char_count - start, |length| {
         length.to_usize().unwrap().min(char_count - start)
     });
   
     let mut start_byte = 0;
     let mut end_byte = val.len();
     for ((idx, (byte_idx, _)) in val.char_indices().enumerate() {
       if idx == start {
         start_byte = byte_idx;
       } else if idx == start + length {
         end_byte = byte_idx;
         break
       }
     }
     // Could even be unchecked
     vals.append_slice(&val[start_byte..end_byte]);
     indexes.append(vals.len() as _);
   }
   
   let data = ArrayDataBuilder::new(array.data_type()).len(array.len()).add_buffer(vals.finish())
     .add_buffer(indexes.finish()).add_buffer(vals.finish());
   
   Ok(GenericStringArray::<OffsetSize>::from(unsafe {data.build_unchecked()}))
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org