You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/06 23:30:52 UTC

[GitHub] [arrow-rs] sunchao opened a new pull request, #1665: Add dictionary array support for substring function

sunchao opened a new pull request, #1665:
URL: https://github.com/apache/arrow-rs/pull/1665

   # Which issue does this PR close?
   
   <!---
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1656.
   
   # Rationale for this change
    
   Currently the `substring` kernel only support "plain" arrays but not dictionary encoded ones. With dictionary array, the compute could be much more efficient since it only needs to be done on the dictionary values.
   
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   This PR adds the support of dictionary array for `substring` kernel.
   
   # Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867265840


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,135 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
+    macro_rules! substring_dict {
+        ($kt: ident, $($t: ident: $gt: ident), *) => {
+            match $kt.as_ref() {
+                $(
+                    &DataType::$t => {
+                        let dict = array
+                            .as_any()
+                            .downcast_ref::<DictionaryArray<$gt>>()
+                            .unwrap_or_else(|| {
+                                panic!("Expect 'DictionaryArray<$gt>' but got {:?}", array)
+                            });
+                        let values = substring(dict.values(), start, length)?;
+                        let result = DictionaryArray::try_new(dict.keys(), &values)?;
+                        Ok(Arc::new(result))
+                    },
+                )*
+                    t => panic!("Unsupported dictionary key type: {}", t)
+            }
+        }
+    }
+
+    match array.data_type() {
+        DataType::Dictionary(kt, _) => {
+            substring_dict!(
+                kt,
+                Int8: Int8Type,

Review Comment:
   We may make this shorter via `concat_idents` (e.g., `concat_idents($t, Type)`) but it's only available in nightly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867324877


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,136 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {

Review Comment:
   Just a nit: Maybe we could let `length` be `Option<u32>`. Because the longest length will not exceed `1<<31 - 1` (for `LargeBinaryArray` and `LargeStringArray`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867274835


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,135 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
+    macro_rules! substring_dict {
+        ($kt: ident, $($t: ident: $gt: ident), *) => {
+            match $kt.as_ref() {
+                $(
+                    &DataType::$t => {
+                        let dict = array
+                            .as_any()
+                            .downcast_ref::<DictionaryArray<$gt>>()
+                            .unwrap_or_else(|| {
+                                panic!("Expect 'DictionaryArray<$gt>' but got {:?}", array)

Review Comment:
   Good catch. Let me just use the data type from the array then.



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,135 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
+    macro_rules! substring_dict {
+        ($kt: ident, $($t: ident: $gt: ident), *) => {
+            match $kt.as_ref() {
+                $(
+                    &DataType::$t => {
+                        let dict = array
+                            .as_any()
+                            .downcast_ref::<DictionaryArray<$gt>>()
+                            .unwrap_or_else(|| {
+                                panic!("Expect 'DictionaryArray<$gt>' but got {:?}", array)

Review Comment:
   Good point. Will fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r868189577


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,136 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {

Review Comment:
   Hmm I think this is not quite related to this PR. I can open another one for the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
viirya commented on PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#issuecomment-1121405749

   Merged, thanks @sunchao 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#issuecomment-1121070147

   Thank you @sunchao ❤️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867392345


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,136 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.

Review Comment:
   We may also update that `Dictionary` arrays with [large]string/[large]binary values are also accepted here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya merged pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
viirya merged PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867275202


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -954,6 +992,56 @@ mod tests {
         without_nulls_generic_string::<i64>()
     }
 
+    #[test]
+    fn dictionary() -> Result<()> {

Review Comment:
   I think it's not necessary to add `test_` prefix for Rust tests since they are already under the `tests` module. The `substring` here also seem redundant since the full test name `compute::kernels::substring::tests::dictionary` already contain it.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867265638


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,135 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {

Review Comment:
   I moved this func to the beginning of the file, before all other non-public ones, for better readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867267938


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,135 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
+    macro_rules! substring_dict {
+        ($kt: ident, $($t: ident: $gt: ident), *) => {
+            match $kt.as_ref() {
+                $(
+                    &DataType::$t => {
+                        let dict = array
+                            .as_any()
+                            .downcast_ref::<DictionaryArray<$gt>>()
+                            .unwrap_or_else(|| {
+                                panic!("Expect 'DictionaryArray<$gt>' but got {:?}", array)

Review Comment:
   Is it too verbose by directly outputting `array` in panic message? Will it print out full content of the array?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867268156


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,135 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
+    macro_rules! substring_dict {
+        ($kt: ident, $($t: ident: $gt: ident), *) => {
+            match $kt.as_ref() {
+                $(
+                    &DataType::$t => {
+                        let dict = array
+                            .as_any()
+                            .downcast_ref::<DictionaryArray<$gt>>()
+                            .unwrap_or_else(|| {
+                                panic!("Expect 'DictionaryArray<$gt>' but got {:?}", array)

Review Comment:
   `$gt` will be interpolated?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
viirya commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867268461


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -954,6 +992,56 @@ mod tests {
         without_nulls_generic_string::<i64>()
     }
 
+    #[test]
+    fn dictionary() -> Result<()> {

Review Comment:
   ```suggestion
       fn test_string_dictionary() -> Result<()> {
   ```



##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -954,6 +992,56 @@ mod tests {
         without_nulls_generic_string::<i64>()
     }
 
+    #[test]
+    fn dictionary() -> Result<()> {

Review Comment:
   ```suggestion
       fn test_substring_dictionary() -> Result<()> {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#issuecomment-1120081463

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1665?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1665](https://codecov.io/gh/apache/arrow-rs/pull/1665?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (17d4b89) into [master](https://codecov.io/gh/apache/arrow-rs/commit/a38e460aca6cb247eac8bb8a6461947f00fe89d0?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (a38e460) will **increase** coverage by `0.02%`.
   > The diff coverage is `95.65%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1665      +/-   ##
   ==========================================
   + Coverage   83.10%   83.13%   +0.02%     
   ==========================================
     Files         193      193              
     Lines       55864    55881      +17     
   ==========================================
   + Hits        46425    46454      +29     
   + Misses       9439     9427      -12     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1665?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow-rs/pull/1665/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9zdWJzdHJpbmcucnM=) | `99.57% <95.65%> (-0.43%)` | :arrow_down: |
   | [arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1665/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5LnJz) | `89.67% <0.00%> (-3.04%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1665/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.68% <0.00%> (ø)` | |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1665/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (ø)` | |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1665/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.80% <0.00%> (+0.39%)` | :arrow_up: |
   | [parquet/src/arrow/converter.rs](https://codecov.io/gh/apache/arrow-rs/pull/1665/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvY29udmVydGVyLnJz) | `83.33% <0.00%> (+19.36%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1665?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1665?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [a38e460...17d4b89](https://codecov.io/gh/apache/arrow-rs/pull/1665?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] HaoYang670 commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
HaoYang670 commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r867324877


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,136 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.
+/// - The function errors if the offset of a substring in the input array is at invalid char boundary (only for \[Large\]String array).
+///
+/// ## Example of trying to get an invalid utf-8 format substring
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("E=mc²")]);
+/// let error = substring(&array, 0, Some(5)).unwrap_err().to_string();
+/// assert!(error.contains("invalid utf-8 boundary"));
+/// ```
+pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {

Review Comment:
   Just a nit: Maybe we could let `length` be `Option<u32>`. Because the longest length will not exceed `1<<32 - 1` (for `LargeBinaryArray` and `LargeStringArray`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] sunchao commented on a diff in pull request #1665: Add dictionary array support for substring function

Posted by GitBox <gi...@apache.org>.
sunchao commented on code in PR #1665:
URL: https://github.com/apache/arrow-rs/pull/1665#discussion_r868189795


##########
arrow/src/compute/kernels/substring.rs:
##########
@@ -18,13 +18,136 @@
 //! Defines kernel to extract a substring of an Array
 //! Supported array types: \[Large\]StringArray, \[Large\]BinaryArray
 
+use crate::array::DictionaryArray;
 use crate::buffer::MutableBuffer;
+use crate::datatypes::*;
 use crate::{array::*, buffer::Buffer};
 use crate::{
     datatypes::DataType,
     error::{ArrowError, Result},
 };
 use std::cmp::Ordering;
+use std::sync::Arc;
+
+/// Returns an ArrayRef with substrings of all the elements in `array`.
+///
+/// # Arguments
+///
+/// * `start` - The start index of all substrings.
+/// If `start >= 0`, then count from the start of the string,
+/// otherwise count from the end of the string.
+///
+/// * `length`(option) - The length of all substrings.
+/// If `length` is `None`, then the substring is from `start` to the end of the string.
+///
+/// Attention: Both `start` and `length` are counted by byte, not by char.
+///
+/// # Basic usage
+/// ```
+/// # use arrow::array::StringArray;
+/// # use arrow::compute::kernels::substring::substring;
+/// let array = StringArray::from(vec![Some("arrow"), None, Some("rust")]);
+/// let result = substring(&array, 1, Some(4)).unwrap();
+/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
+/// assert_eq!(result, &StringArray::from(vec![Some("rrow"), None, Some("ust")]));
+/// ```
+///
+/// # Error
+/// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array.

Review Comment:
   Good catch. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org