You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/18 06:55:32 UTC

[GitHub] [arrow] seddonm1 opened a new pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

seddonm1 opened a new pull request #9243:
URL: https://github.com/apache/arrow/pull/9243


   I am throwing up a draft PR just to give people a heads up that I am working through these.
   
   **Changes**
   I have had to make some changes to the existing implementations:
   - `concat` had the incorrect behavior for how to handle NULLs where any null would result in a NULL where the Postgres implementation documents: `NULL arguments are ignored.`.
   - `ltrim` and `rtrim` were implemented to support only the default space character whereas Postgres supports an optional second parameter: `ltrim('zzzytest', 'xyz')` so that has been updated.
   
   **Questions**
   - @jorgecarleitao I think we need this `Signature::Uniform` vs `Signature::UniformEqual`. This came up with a `left` function that takes a `(utf8, int64)` signature and it is not correct to try to cast both to `utf8`. You can see my implementation here but perhaps you have a better method.
   - @jorgecarleitao Do you have a nice way of throwing errors in the map itself? See commented out Chr that I would appreciate your assistance: https://github.com/apache/arrow/compare/master...seddonm1:postgres-string-functions?expand=1#diff-abe8768fe7124198cca7a84ad7b2c678b3cc8e5de3d1bc867d498536a2fdddc7R287
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-778599946


   I am going in to review this PR -- I am getting second cup of ☕  and settling down for a good read 👓 ...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-801488843


   Closed after splitting.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575654714



##########
File path: rust/arrow/src/compute/kernels/length.rs
##########
@@ -62,7 +62,7 @@ where
     Ok(make_array(Arc::new(data)))
 }
 
-/// Returns an array of Int32/Int64 denoting the number of characters in each string in the array.

Review comment:
       👍 

##########
File path: rust/arrow/src/compute/kernels/bit_length.rs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       FWIW I think extracting the `bit_length` kernel into its own PR for review /merge would be fairly easy. 

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),

Review comment:
       the comments (and code) suggest that only 1 argument is required, but now the error message says 2 are required

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(120)))],
+            Some("x"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aabbcc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aacc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![lit(ScalarValue::Utf8(None))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|bb|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("hi THOMAS".to_string())))],
+            Some("Hi Thomas"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("".to_string())))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("ab"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("abc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("abcdefabcdefabcdefahi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Utf8(Some("5".to_string()))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("LOWER".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("lower".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("Thomas".to_string()))),
+                lit(ScalarValue::Utf8(Some(".[mN]a.".to_string()))),
+                lit(ScalarValue::Utf8(Some("M".to_string()))),
+            ],
+            Some("ThM"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+            ],
+            Some("fooXbaz"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXarYXazY"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("gi".to_string()))),
+            ],
+            Some("XXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("i".to_string()))),
+            ],
+            Some("XabcABC"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            Some("PgPgPgPg"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abXXefabXXef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("notmatch".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abcdefabcdef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("abcde".to_string())))],
+            Some("edcba"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("loẅks".to_string())))],
+            Some("skẅol"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("de"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("cde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("hixyx"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("hiabcdefabcdefabcdefa"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some(" trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(Some(" trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::SplitPart,
+            vec![
+                lit(ScalarValue::Utf8(Some("abc~@~def~@~ghi".to_string()))),
+                lit(ScalarValue::Utf8(Some("~@~".to_string()))),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            Some("def"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::SplitPart,
+            vec![
+                lit(ScalarValue::Utf8(Some("abc~@~def~@~ghi".to_string()))),
+                lit(ScalarValue::Utf8(Some("~@~".to_string()))),
+                lit(ScalarValue::Int64(Some(20))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some("alphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(1))),
+            ],
+            Some("alphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            Some("lphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+            ],
+            Some("phabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(-3))),
+            ],
+            Some("alphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(30))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            Some("ph"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+                lit(ScalarValue::Int64(Some(20))),
+            ],
+            Some("phabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Int64(Some(20))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(Some("12345".to_string()))),
+                lit(ScalarValue::Utf8(Some("143".to_string()))),
+                lit(ScalarValue::Utf8(Some("ax".to_string()))),
+            ],
+            Some("a2x5"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("143".to_string()))),
+                lit(ScalarValue::Utf8(Some("ax".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(Some("12345".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("ax".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(Some("12345".to_string()))),
+                lit(ScalarValue::Utf8(Some("143".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Upper,
+            vec![lit(ScalarValue::Utf8(Some("upper".to_string())))],
+            Some("UPPER"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Upper,
+            vec![lit(ScalarValue::Utf8(Some("UPPER".to_string())))],
+            Some("UPPER"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Upper,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ToHex,
+            vec![lit(ScalarValue::Int32(Some(2147483647)))],
+            Some("7fffffff"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ToHex,
+            vec![lit(ScalarValue::Int32(None))],
+            None,
+        )?;
+
+        Ok(())
+    }
+
+    fn generic_string_i32_function(

Review comment:
       It might be cool to make this a macro so if the test fails he correct line number would be reported.

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -496,22 +919,77 @@ pub fn create_physical_expr(
 fn signature(fun: &BuiltinScalarFunction) -> Signature {
     // note: the physical expression must accept the type returned by this function or the execution panics.
 
-    // for now, the list is small, as we do not have many built-in functions.

Review comment:
       :(
   
   I do wonder given the several `match` statements we have for each built in function if there is some better way to represent functions in a uniform way (perhaps following the user defined function model) where we can just define a function in one place and then register it with the execution context

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(

Review comment:
       👍  this pattern of testing using macros is very cool

##########
File path: rust/datafusion/Cargo.toml
##########
@@ -64,6 +64,9 @@ log = "^0.4"
 md-5 = "^0.9.1"
 sha2 = "^0.9.1"
 ordered-float = "2.0"
+unicode-segmentation = "^1.7.1"

Review comment:
       This is something I have been thinking a lot about -- how can we keep DataFusion's dependency stack reasonable (it is already pretty large and it just keeps getting larger). 
   
   One thing I was thinking about was making some of these dependencies optional (so that we had features like `regex` and `unicode` and `hash` which would only pull in the dependencies / have those functions if the features were enabled.
   
   What do you think @jorgecarleitao  / @andygrove / @ovr  ? If it is a reasonable idea (I think we mentioned it before) I will file a JIRA to track?

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))

Review comment:
       TIL "graphemes" -- one of the needs of the new dependency.
   I wonder if there is a reason that `graphemes` is used here but [char_indicies](https://doc.rust-lang.org/std/primitive.str.html#method.char_indices) is used in other functions such as `left` and `right`  to find "character" boundaries
   
   Or put another way I wonder how important it is to distinguish between "characters" and "graphemes" in general and why not always one way or the other. 

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(120)))],
+            Some("x"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aabbcc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aacc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![lit(ScalarValue::Utf8(None))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|bb|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            Some(""),

Review comment:
       that is interesting that an input of `Null` produces an empty string

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());

Review comment:
       I wonder if `initcap` in postgres handles Unicode upper casing as well? As in should this function be uppercasing according to unicode rather than ASCII?

##########
File path: rust/arrow/src/compute/kernels/bit_length.rs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines kernel for length of a string array
+
+use crate::{array::*, buffer::Buffer};
+use crate::{
+    datatypes::DataType,
+    error::{ArrowError, Result},
+};
+use std::sync::Arc;
+
+fn bit_length_string<OffsetSize>(array: &Array, data_type: DataType) -> Result<ArrayRef>
+where
+    OffsetSize: OffsetSizeTrait,
+{
+    // note: offsets are stored as u8, but they can be interpreted as OffsetSize
+    let offsets = &array.data_ref().buffers()[0];
+    // this is a 30% improvement over iterating over u8s and building OffsetSize, which
+    // justifies the usage of `unsafe`.
+    let slice: &[OffsetSize] =
+        &unsafe { offsets.typed_data::<OffsetSize>() }[array.offset()..];
+
+    let bit_size = OffsetSize::from_usize(8).unwrap();
+    let lengths = slice
+        .windows(2)
+        .map(|offset| (offset[1] - offset[0]) * bit_size);
+
+    // JUSTIFICATION
+    //  Benefit
+    //      ~60% speedup
+    //  Soundness
+    //      `values` is an iterator with a known size.
+    let buffer = unsafe { Buffer::from_trusted_len_iter(lengths) };
+
+    let null_bit_buffer = array
+        .data_ref()
+        .null_bitmap()
+        .as_ref()
+        .map(|b| b.bits.clone());
+
+    let data = ArrayData::new(
+        data_type,
+        array.len(),
+        None,
+        null_bit_buffer,
+        0,
+        vec![buffer],
+        vec![],
+    );
+    Ok(make_array(Arc::new(data)))
+}
+
+/// Returns an array of Int32/Int64 denoting the number of bits in each string in the array.
+///
+/// * this only accepts StringArray/Utf8 and LargeString/LargeUtf8
+/// * bit_length of null is null.
+/// * bit_length is in number of bits
+pub fn bit_length(array: &Array) -> Result<ArrayRef> {
+    match array.data_type() {
+        DataType::Utf8 => bit_length_string::<i32>(array, DataType::Int32),
+        DataType::LargeUtf8 => bit_length_string::<i64>(array, DataType::Int64),
+        _ => Err(ArrowError::ComputeError(format!(
+            "bit_length not supported for {:?}",
+            array.data_type()
+        ))),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn cases() -> Vec<(Vec<&'static str>, usize, Vec<i32>)> {
+        fn double_vec<T: Clone>(v: Vec<T>) -> Vec<T> {
+            [&v[..], &v[..]].concat()
+        }
+
+        // a large array
+        let mut values = vec!["one", "on", "o", ""];
+        let mut expected = vec![24, 16, 8, 0];
+        for _ in 0..10 {
+            values = double_vec(values);
+            expected = double_vec(expected);
+        }
+
+        vec![
+            (vec!["hello", " ", "world"], 3, vec![40, 8, 40]),

Review comment:
       What value does this case add over the next one (with a `"!"`) -- as in why have both?

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -259,6 +392,76 @@ pub fn return_type(
                 ));
             }
         }),
+        BuiltinScalarFunction::OctetLength => Ok(match arg_types[0] {
+            DataType::LargeUtf8 => DataType::Int64,
+            DataType::Utf8 => DataType::Int32,
+            _ => {
+                // this error is internal as `data_types` should have captured this.
+                return Err(DataFusionError::Internal(
+                    "The length function can only accept strings.".to_string(),

Review comment:
       ```suggestion
                       "The octet_length function can only accept strings.".to_string(),
   ```

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1986,157 +1975,377 @@ async fn csv_group_by_date() -> Result<()> {
     Ok(())
 }
 
-#[tokio::test]
-async fn string_expressions() -> Result<()> {
-    let mut ctx = ExecutionContext::new();
-    let sql = "SELECT
-        char_length('tom') AS char_length
-        ,char_length(NULL) AS char_length_null
-        ,character_length('tom') AS character_length
-        ,character_length(NULL) AS character_length_null
-        ,lower('TOM') AS lower
-        ,lower(NULL) AS lower_null
-        ,upper('tom') AS upper
-        ,upper(NULL) AS upper_null
-        ,trim(' tom ') AS trim
-        ,trim(NULL) AS trim_null
-        ,ltrim(' tom ') AS trim_left
-        ,rtrim(' tom ') AS trim_right
-    ";
-    let actual = execute(&mut ctx, sql).await;
-
-    let expected = vec![vec![
-        "3", "NULL", "3", "NULL", "tom", "NULL", "TOM", "NULL", "tom", "NULL", "tom ",
-        " tom",
-    ]];
-    assert_eq!(expected, actual);
-    Ok(())
-}
-
-#[tokio::test]
-async fn boolean_expressions() -> Result<()> {
-    let mut ctx = ExecutionContext::new();
-    let sql = "SELECT
-        true AS val_1,
-        false AS val_2
-    ";
-    let actual = execute(&mut ctx, sql).await;
-
-    let expected = vec![vec!["true", "false"]];
-    assert_eq!(expected, actual);
-    Ok(())
-}
-
-#[tokio::test]
-async fn interval_expressions() -> Result<()> {
-    let mut ctx = ExecutionContext::new();
-    let sql = "SELECT
-        (interval '1') as interval_1,
-        (interval '1 second') as interval_2,
-        (interval '500 milliseconds') as interval_3,
-        (interval '5 second') as interval_4,
-        (interval '1 minute') as interval_5,
-        (interval '0.5 minute') as interval_6,
-        (interval '.5 minute') as interval_7,
-        (interval '5 minute') as interval_8,
-        (interval '5 minute 1 second') as interval_9,
-        (interval '1 hour') as interval_10,
-        (interval '5 hour') as interval_11,
-        (interval '1 day') as interval_12,
-        (interval '1 day 1') as interval_13,
-        (interval '0.5') as interval_14,
-        (interval '0.5 day 1') as interval_15,
-        (interval '0.49 day') as interval_16,
-        (interval '0.499 day') as interval_17,
-        (interval '0.4999 day') as interval_18,
-        (interval '0.49999 day') as interval_19,
-        (interval '0.49999999999 day') as interval_20,
-        (interval '5 day') as interval_21,
-        (interval '5 day 4 hours 3 minutes 2 seconds 100 milliseconds') as interval_22,
-        (interval '0.5 month') as interval_23,
-        (interval '1 month') as interval_24,
-        (interval '5 month') as interval_25,
-        (interval '13 month') as interval_26,
-        (interval '0.5 year') as interval_27,
-        (interval '1 year') as interval_28,
-        (interval '2 year') as interval_29
-    ";
-    let actual = execute(&mut ctx, sql).await;
-
-    let expected = vec![vec![
-        "0 years 0 mons 0 days 0 hours 0 mins 1.00 secs",
-        "0 years 0 mons 0 days 0 hours 0 mins 1.00 secs",
-        "0 years 0 mons 0 days 0 hours 0 mins 0.500 secs",
-        "0 years 0 mons 0 days 0 hours 0 mins 5.00 secs",
-        "0 years 0 mons 0 days 0 hours 1 mins 0.00 secs",
-        "0 years 0 mons 0 days 0 hours 0 mins 30.00 secs",
-        "0 years 0 mons 0 days 0 hours 0 mins 30.00 secs",
-        "0 years 0 mons 0 days 0 hours 5 mins 0.00 secs",
-        "0 years 0 mons 0 days 0 hours 5 mins 1.00 secs",
-        "0 years 0 mons 0 days 1 hours 0 mins 0.00 secs",
-        "0 years 0 mons 0 days 5 hours 0 mins 0.00 secs",
-        "0 years 0 mons 1 days 0 hours 0 mins 0.00 secs",
-        "0 years 0 mons 1 days 0 hours 0 mins 1.00 secs",
-        "0 years 0 mons 0 days 0 hours 0 mins 0.500 secs",
-        "0 years 0 mons 0 days 12 hours 0 mins 1.00 secs",
-        "0 years 0 mons 0 days 11 hours 45 mins 36.00 secs",
-        "0 years 0 mons 0 days 11 hours 58 mins 33.596 secs",
-        "0 years 0 mons 0 days 11 hours 59 mins 51.364 secs",
-        "0 years 0 mons 0 days 11 hours 59 mins 59.136 secs",
-        "0 years 0 mons 0 days 12 hours 0 mins 0.00 secs",
-        "0 years 0 mons 5 days 0 hours 0 mins 0.00 secs",
-        "0 years 0 mons 5 days 4 hours 3 mins 2.100 secs",
-        "0 years 0 mons 15 days 0 hours 0 mins 0.00 secs",
-        "0 years 1 mons 0 days 0 hours 0 mins 0.00 secs",
-        "0 years 5 mons 0 days 0 hours 0 mins 0.00 secs",
-        "1 years 1 mons 0 days 0 hours 0 mins 0.00 secs",
-        "0 years 6 mons 0 days 0 hours 0 mins 0.00 secs",
-        "1 years 0 mons 0 days 0 hours 0 mins 0.00 secs",
-        "2 years 0 mons 0 days 0 hours 0 mins 0.00 secs",
-    ]];
-    assert_eq!(expected, actual);
-    Ok(())
-}
-
-#[tokio::test]
-async fn crypto_expressions() -> Result<()> {
-    let mut ctx = ExecutionContext::new();
-    let sql = "SELECT
-        md5('tom') AS md5_tom,
-        md5('') AS md5_empty_str,
-        md5(null) AS md5_null,
-        sha224('tom') AS sha224_tom,
-        sha224('') AS sha224_empty_str,
-        sha224(null) AS sha224_null,
-        sha256('tom') AS sha256_tom,
-        sha256('') AS sha256_empty_str,
-        sha384('tom') AS sha348_tom,
-        sha384('') AS sha384_empty_str,
-        sha512('tom') AS sha512_tom,
-        sha512('') AS sha512_empty_str
-    ";
-    let actual = execute(&mut ctx, sql).await;
-
-    let expected = vec![vec![
-        "34b7da764b21d298ef307d04d8152dc5",
-        "d41d8cd98f00b204e9800998ecf8427e",
-        "NULL",
-        "0bf6cb62649c42a9ae3876ab6f6d92ad36cb5414e495f8873292be4d",
-        "d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f",
-        "NULL",
-        "e1608f75c5d7813f3d4031cb30bfb786507d98137538ff8e128a6ff74e84e643",
-        "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
-        "096f5b68aa77848e4fdf5c1c0b350de2dbfad60ffd7c25d9ea07c6c19b8a4d55a9187eb117c557883f58c16dfac3e343",
-        "38b060a751ac96384cd9327eb1b1e36a21fdb71114be07434c0cc7bf63f6e1da274edebfe76f65fbd51ad2f14898b95b",
-        "6e1b9b3fe840680e37051f7ad5e959d6f39ad0f8885d855166f55c659469d3c8b78118c44a2a49c72ddb481cd6d8731034e11cc030070ba843a90b3495cb8d3e",
-        "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e"
-    ]];
-    assert_eq!(expected, actual);
+macro_rules! test_expression {
+    ($SQL:expr, $EXPECTED:expr) => {
+        let mut ctx = ExecutionContext::new();
+        println!("test_expression: {}", $SQL);
+        let sql = format!("SELECT {}", $SQL);
+        let actual = execute(&mut ctx, sql.as_str()).await;
+        assert_eq!($EXPECTED, actual[0][0]);
+    };
+}
+
+#[tokio::test]
+async fn test_string_expressions() -> Result<()> {
+    test_expression!("ascii('')", "0");

Review comment:
       ❤️ 

##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       As @jorgecarleitao says, this is another change from this PR that would be great to break out into its own PR. 

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(120)))],
+            Some("x"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aabbcc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aacc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![lit(ScalarValue::Utf8(None))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|bb|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("hi THOMAS".to_string())))],
+            Some("Hi Thomas"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("".to_string())))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("ab"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("abc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("abcdefabcdefabcdefahi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Utf8(Some("5".to_string()))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("LOWER".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("lower".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("Thomas".to_string()))),
+                lit(ScalarValue::Utf8(Some(".[mN]a.".to_string()))),
+                lit(ScalarValue::Utf8(Some("M".to_string()))),
+            ],
+            Some("ThM"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+            ],
+            Some("fooXbaz"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXarYXazY"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("gi".to_string()))),
+            ],
+            Some("XXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("i".to_string()))),
+            ],
+            Some("XabcABC"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            Some("PgPgPgPg"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abXXefabXXef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("notmatch".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abcdefabcdef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("abcde".to_string())))],
+            Some("edcba"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("loẅks".to_string())))],
+            Some("skẅol"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("de"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("cde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("hixyx"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("hiabcdefabcdefabcdefa"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some(" trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(

Review comment:
       the trim functions might benefit from also testing more than one `" "` on each side as well as some other whitespace characters. (like `"\n"`)

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());
+                    } else {
+                        char_vector.push(c.to_ascii_uppercase());
+                    }
+                    wasalnum = ('A'..='Z').contains(&c)
+                        || ('a'..='z').contains(&c)
+                        || ('0'..='9').contains(&c);
+                }
+                char_vector.iter().collect::<String>()
+            })
+        })
+        .collect())
+}
+
+/// Returns first n characters in the string, or when n is negative, returns all but last |n| characters.
+pub fn left<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
             } else {
-                owned_string.push_str(&arg.value(index));
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => {
+                            x.char_indices().nth(n as usize).map_or(x, |(i, _)| &x[..i])
+                        }
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[..i + 1]),
+                    }
+                })
             }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
+pub fn lpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.insert_str(0, " ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
         }
-        if is_null {
-            builder.append_null()?;
-        } else {
-            builder.append_value(&owned_string)?;
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.insert_str(
+                                    0,
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "lpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start of string.
+pub fn ltrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_start()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
         }
+        other => Err(DataFusionError::Internal(format!(
+            "ltrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn repeat<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let number_array: &Int64Array =
+        args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if number_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.repeat(number_array.value(i) as usize))
+            }
+        })
+        .collect())
+}
+
+/// Replaces all occurrences in string of substring from with substring to.
+pub fn replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.replace(from_array.value(i), to_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+// used to replace POSIX capture groups (like \1) with Rust Regex group (like ${1})
+fn regex_replace_posix_groups(replacement: &str) -> String {
+    lazy_static! {
+        static ref CAPTURE_GROUPS_RE: Regex = Regex::new("(\\\\)(\\d*)").unwrap();
     }
-    Ok(builder.finish())
+    CAPTURE_GROUPS_RE
+        .replace_all(replacement, "$${$2}")
+        .into_owned()
+}
+
+/// Replaces substring(s) matching a POSIX regular expression
+pub fn regexp_replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    // creating Regex is expensive so create hashmap for memoization
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+
+    match args.len() {
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let pattern = pattern_array.value(i).to_string();
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+                            re.map(|re| re.replace(x, replacement.as_str()))
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        4 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let flags_array: &StringArray = args[3]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) || flags_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+
+                            let flags = flags_array.value(i);
+                            let (pattern, replace_all) = if flags == "g" {
+                                (pattern_array.value(i).to_string(), true)
+                            } else if flags.contains('g') {
+                                (format!("(?{}){}", flags.to_string().replace("g", ""), pattern_array.value(i)), true)
+                            } else {
+                                (format!("(?{}){}", flags, pattern_array.value(i)), false)
+                            };
+
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+
+                            re.map(|re| {
+                                if replace_all {
+                                    re.replace_all(x, replacement.as_str())
+                                } else {
+                                    re.replace(x, replacement.as_str())
+                                }
+                            })
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "regexp_replace was called with {} arguments. It requires at least 3 and at most 4.",
+            other
+        ))),
+    }
+}
+
+/// Reverses the order of the characters in the string.
+pub fn reverse<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).rev().collect::<String>()))
+        .collect())
+}
+
+/// Returns last n characters in the string, or when n is negative, returns all but first |n| characters.
+pub fn right<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => x
+                            .char_indices()
+                            .nth(n as usize)
+                            .map_or(x, |(i, _)| &x[i + 1..]),
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[i..]),
+                    }
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
+pub fn rpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.push_str(" ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.push_str(
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the end of string.
+pub fn rtrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_end()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rtrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.

Review comment:
       ```suggestion
   /// Splits a string using a given delimiter and returns a part by index
   ```

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());
+                    } else {
+                        char_vector.push(c.to_ascii_uppercase());
+                    }
+                    wasalnum = ('A'..='Z').contains(&c)
+                        || ('a'..='z').contains(&c)
+                        || ('0'..='9').contains(&c);
+                }
+                char_vector.iter().collect::<String>()
+            })
+        })
+        .collect())
+}
+
+/// Returns first n characters in the string, or when n is negative, returns all but last |n| characters.
+pub fn left<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
             } else {
-                owned_string.push_str(&arg.value(index));
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => {
+                            x.char_indices().nth(n as usize).map_or(x, |(i, _)| &x[..i])
+                        }
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[..i + 1]),
+                    }
+                })
             }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
+pub fn lpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.insert_str(0, " ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
         }
-        if is_null {
-            builder.append_null()?;
-        } else {
-            builder.append_value(&owned_string)?;
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.insert_str(
+                                    0,
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "lpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start of string.
+pub fn ltrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_start()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
         }
+        other => Err(DataFusionError::Internal(format!(
+            "ltrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn repeat<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let number_array: &Int64Array =
+        args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if number_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.repeat(number_array.value(i) as usize))
+            }
+        })
+        .collect())
+}
+
+/// Replaces all occurrences in string of substring from with substring to.
+pub fn replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.replace(from_array.value(i), to_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+// used to replace POSIX capture groups (like \1) with Rust Regex group (like ${1})
+fn regex_replace_posix_groups(replacement: &str) -> String {
+    lazy_static! {
+        static ref CAPTURE_GROUPS_RE: Regex = Regex::new("(\\\\)(\\d*)").unwrap();
     }
-    Ok(builder.finish())
+    CAPTURE_GROUPS_RE
+        .replace_all(replacement, "$${$2}")
+        .into_owned()
+}
+
+/// Replaces substring(s) matching a POSIX regular expression
+pub fn regexp_replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    // creating Regex is expensive so create hashmap for memoization
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+
+    match args.len() {
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let pattern = pattern_array.value(i).to_string();
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+                            re.map(|re| re.replace(x, replacement.as_str()))
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        4 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let flags_array: &StringArray = args[3]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) || flags_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+
+                            let flags = flags_array.value(i);
+                            let (pattern, replace_all) = if flags == "g" {
+                                (pattern_array.value(i).to_string(), true)
+                            } else if flags.contains('g') {
+                                (format!("(?{}){}", flags.to_string().replace("g", ""), pattern_array.value(i)), true)
+                            } else {
+                                (format!("(?{}){}", flags, pattern_array.value(i)), false)
+                            };
+
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+
+                            re.map(|re| {
+                                if replace_all {
+                                    re.replace_all(x, replacement.as_str())
+                                } else {
+                                    re.replace(x, replacement.as_str())
+                                }
+                            })
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "regexp_replace was called with {} arguments. It requires at least 3 and at most 4.",
+            other
+        ))),
+    }
+}
+
+/// Reverses the order of the characters in the string.
+pub fn reverse<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).rev().collect::<String>()))
+        .collect())
+}
+
+/// Returns last n characters in the string, or when n is negative, returns all but first |n| characters.
+pub fn right<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => x
+                            .char_indices()
+                            .nth(n as usize)
+                            .map_or(x, |(i, _)| &x[i + 1..]),
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[i..]),
+                    }
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
+pub fn rpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.push_str(" ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.push_str(
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the end of string.
+pub fn rtrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_end()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rtrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn split_part<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let delimiter_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let n_array: &Int64Array = args[2].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if delimiter_array.is_null(i) || n_array.is_null(i) {
+                Ok(None)
+            } else {
+                x.map(|x: &str| {
+                    let delimiter = delimiter_array.value(i);
+                    let n = n_array.value(i);
+                    if n <= 0 {
+                        Err(DataFusionError::Execution(
+                            "negative substring length not allowed".to_string(),

Review comment:
       ```suggestion
                               format!("negative substring length {} not allowed in split_part", n),
   ```

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());
+                    } else {
+                        char_vector.push(c.to_ascii_uppercase());
+                    }
+                    wasalnum = ('A'..='Z').contains(&c)
+                        || ('a'..='z').contains(&c)
+                        || ('0'..='9').contains(&c);
+                }
+                char_vector.iter().collect::<String>()
+            })
+        })
+        .collect())
+}
+
+/// Returns first n characters in the string, or when n is negative, returns all but last |n| characters.
+pub fn left<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
             } else {
-                owned_string.push_str(&arg.value(index));
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => {
+                            x.char_indices().nth(n as usize).map_or(x, |(i, _)| &x[..i])
+                        }
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[..i + 1]),
+                    }
+                })
             }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
+pub fn lpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.insert_str(0, " ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
         }
-        if is_null {
-            builder.append_null()?;
-        } else {
-            builder.append_value(&owned_string)?;
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.insert_str(
+                                    0,
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "lpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start of string.
+pub fn ltrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_start()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
         }
+        other => Err(DataFusionError::Internal(format!(
+            "ltrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn repeat<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let number_array: &Int64Array =
+        args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if number_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.repeat(number_array.value(i) as usize))
+            }
+        })
+        .collect())
+}
+
+/// Replaces all occurrences in string of substring from with substring to.
+pub fn replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.replace(from_array.value(i), to_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+// used to replace POSIX capture groups (like \1) with Rust Regex group (like ${1})
+fn regex_replace_posix_groups(replacement: &str) -> String {
+    lazy_static! {
+        static ref CAPTURE_GROUPS_RE: Regex = Regex::new("(\\\\)(\\d*)").unwrap();
     }
-    Ok(builder.finish())
+    CAPTURE_GROUPS_RE
+        .replace_all(replacement, "$${$2}")
+        .into_owned()
+}
+
+/// Replaces substring(s) matching a POSIX regular expression
+pub fn regexp_replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    // creating Regex is expensive so create hashmap for memoization
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+
+    match args.len() {
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let pattern = pattern_array.value(i).to_string();
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+                            re.map(|re| re.replace(x, replacement.as_str()))
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        4 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let flags_array: &StringArray = args[3]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) || flags_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+
+                            let flags = flags_array.value(i);
+                            let (pattern, replace_all) = if flags == "g" {
+                                (pattern_array.value(i).to_string(), true)
+                            } else if flags.contains('g') {
+                                (format!("(?{}){}", flags.to_string().replace("g", ""), pattern_array.value(i)), true)
+                            } else {
+                                (format!("(?{}){}", flags, pattern_array.value(i)), false)
+                            };
+
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+
+                            re.map(|re| {
+                                if replace_all {
+                                    re.replace_all(x, replacement.as_str())
+                                } else {
+                                    re.replace(x, replacement.as_str())
+                                }
+                            })
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "regexp_replace was called with {} arguments. It requires at least 3 and at most 4.",
+            other
+        ))),
+    }
+}
+
+/// Reverses the order of the characters in the string.
+pub fn reverse<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).rev().collect::<String>()))
+        .collect())
+}
+
+/// Returns last n characters in the string, or when n is negative, returns all but first |n| characters.
+pub fn right<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => x
+                            .char_indices()
+                            .nth(n as usize)
+                            .map_or(x, |(i, _)| &x[i + 1..]),
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[i..]),
+                    }
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
+pub fn rpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.push_str(" ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.push_str(
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the end of string.
+pub fn rtrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_end()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rtrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn split_part<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let delimiter_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let n_array: &Int64Array = args[2].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if delimiter_array.is_null(i) || n_array.is_null(i) {
+                Ok(None)
+            } else {
+                x.map(|x: &str| {
+                    let delimiter = delimiter_array.value(i);
+                    let n = n_array.value(i);
+                    if n <= 0 {
+                        Err(DataFusionError::Execution(
+                            "negative substring length not allowed".to_string(),
+                        ))
+                    } else {
+                        let v: Vec<&str> = x.split(delimiter).collect();
+                        match v.get(n as usize - 1) {
+                            Some(s) => Ok(*s),
+                            None => Ok(""),
+                        }
+                    }
+                })
+                .transpose()
+            }
+        })
+        .collect()
+}
+
+/// Returns true if string starts with prefix.
+pub fn starts_with<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<BooleanArray> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let prefix_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if prefix_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.starts_with(prefix_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+/// Returns starting index of specified substring within string, or zero if it's not present. (Same as position(substring in string), but note the reversed argument order.)
+pub fn strpos_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let substring_array: &GenericStringArray<i32> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal(
+                "could not cast substring to StringArray".to_string(),
+            )
+        })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if substring_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let substring: &str = substring_array.value(i);
+                    // the rfind method returns the byte index which may or may not be the same as the character index due to UTF8 encoding
+                    // this method first finds the matching byte using rfind
+                    // then maps that to the character index by matching on the grapheme_index of the byte_index
+                    x.to_string().rfind(substring).map_or(0, |byte_offset| {
+                        x.grapheme_indices(true)
+                            .collect::<Vec<(usize, &str)>>()
+                            .iter()
+                            .enumerate()
+                            .filter(|(_, (offset, _))| *offset == byte_offset)
+                            .map(|(index, _)| index as i32)
+                            .collect::<Vec<i32>>()
+                            .first()
+                            .unwrap()
+                            .to_owned()
+                            + 1
+                    })
+                })
+            }
+        })
+        .collect())
+}
+
+/// Returns starting index of specified substring within string, or zero if it's not present. (Same as position(substring in string), but note the reversed argument order.)
+pub fn strpos_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let substring_array: &GenericStringArray<i64> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal(
+                "could not cast substring to StringArray".to_string(),
+            )
+        })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if substring_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let substring: &str = substring_array.value(i);
+                    // the rfind method returns the byte index which may or may not be the same as the character index due to UTF8 encoding
+                    // this method first finds the matching byte using rfind
+                    // then maps that to the character index by matching on the grapheme_index of the byte_index
+                    x.to_string().rfind(substring).map_or(0, |byte_offset| {
+                        x.grapheme_indices(true)
+                            .collect::<Vec<(usize, &str)>>()
+                            .iter()
+                            .enumerate()
+                            .filter(|(_, (offset, _))| *offset == byte_offset)
+                            .map(|(index, _)| index as i64)
+                            .collect::<Vec<i64>>()
+                            .first()
+                            .unwrap()
+                            .to_owned()
+                            + 1
+                    })
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extracts the substring of string starting at the start'th character, and extending for count characters if that is specified. (Same as substring(string from start for count).)
+pub fn substr<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast string to StringArray".to_string(),
+                    )
+                })?;
+
+            let start_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast start to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if start_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let start: i64 = start_array.value(i);
+                            let start_usize = start as usize;
+                            if start <= 0 {
+                                x
+                            } else if x.len() < start_usize {
+                                ""
+                            } else {
+                                &x[start_usize - 1..]
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast string to StringArray".to_string(),
+                    )
+                })?;
+
+            let start_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast start to Int64Array".to_string(),
+                    )
+                })?;
+
+            let count_array: &Int64Array = args[2]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast count to Int64Array".to_string(),
+                    )
+                })?;
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if start_array.is_null(i) || count_array.is_null(i) {
+                        Ok(None)
+                    } else {
+                        x.map(|x: &str| {
+                            let start: i64 = start_array.value(i);
+                            let count = count_array.value(i);
+
+                            let start_pos = (start as usize) - 1;
+                            let count_usize = count as usize;
+
+                            if count < 0 {
+                                Err(DataFusionError::Execution(
+                                    "negative substring length not allowed".to_string(),
+                                ))
+                            } else if start <= 0 {
+                                Ok(x)
+                            } else if x.len() < start_pos {
+                                Ok("")
+                            } else if x.len() < start_pos + count_usize {
+                                Ok(&x[start_pos..])
+                            } else {
+                                Ok(&x[start_pos..start_pos + count_usize])
+                            }
+                        })
+                        .transpose()
+                    }
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "substr was called with {} arguments. It requires 2 or 3.",
+            other
+        ))),
+    }
+}
+
+/// Converts the number to its equivalent hexadecimal representation.
+pub fn to_hex_i32(args: &[ArrayRef]) -> Result<GenericStringArray<i32>> {
+    let integer_array: &Int32Array =
+        args[0].as_any().downcast_ref::<Int32Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(integer_array
+        .iter()
+        .map(|x| x.map(|x| format!("{:x}", x)))
+        .collect())
+}
+
+/// Converts the number to its equivalent hexadecimal representation.
+pub fn to_hex_i64(args: &[ArrayRef]) -> Result<GenericStringArray<i64>> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(integer_array
+        .iter()
+        .map(|x| x.map(|x| format!("{:x}", x)))
+        .collect())
+}
+
+/// Replaces each character in string that matches a character in the from set with the corresponding character in the to set. If from is longer than to, occurrences of the extra characters in from are deleted.
+pub fn translate<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let from = from_array.value(i).chars().collect::<Vec<char>>();
+                    // create a hashmap to change from O(n) to O(1) from lookup
+                    let mut from_map: HashMap<char, usize> =
+                        HashMap::with_capacity(from.len());
+                    from.iter().enumerate().for_each(|(index, c)| {
+                        from_map.insert(c.to_owned(), index);
+                    });

Review comment:
       I don't think it really matters but I think you could  build the hashmap by `collecting` the iterator directly. Something like
   
   ```suggestion
                       let mut from_map: HashMap<char, usize> =
                         from.iter()
                           .enumerate()
                           .map(|(index, c)| (c.to_owned(), index))
                         .collect();
   ```
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567147558



##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       @jorgecarleitao as above: https://github.com/seddonm1/arrow/tree/oneof-function-signature




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-783370975


   @seddonm1  just to be clear, your plan is still to merge this branch in in smaller chunks -- e.g. https://github.com/apache/arrow/pull/9509


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-777770582


   I think the Clippy CI check on this PR is failing due to a new stable rust being released. I am working on a fix here https://github.com/apache/arrow/pull/9476


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567148220



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       @jorgecarleitao 
   The `FromIterator<Option<Ptr>>` for `GenericStringArray` does not work with `<Option<Result<Ptr>>>` so it is not as simple as I had hoped.
   
   Perhaps we need to implement `FromIterator<Option<Result<Ptr>>>` as well so that we can support this use case?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-777045974


   @andygrove @alamb @jorgecarleitao 
   Here is the big PR that I was talking about in the Arrow call. I can rebase easily enough but I guess apart from the significant number of new lines (a lot of boilerplate) the key question is (from above):
   
   I think we need this Signature::OneOf. A good example is `lpad` which is either:
   [string, int] or [string, int, string] signature. You can see my implementation here but perhaps you have a better ideas and I don't know who wrote the original code. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (a69e099) into [master](https://codecov.io/gh/apache/arrow/commit/77ae93d6ecaac8fb5f4a18ca5287b7456cd88784?el=desc) (77ae93d) will **increase** coverage by `0.38%`.
   > The diff coverage is `88.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   82.00%   82.39%   +0.38%     
   ==========================================
     Files         230      231       +1     
     Lines       53487    55715    +2228     
   ==========================================
   + Hits        43864    45906    +2042     
   - Misses       9623     9809     +186     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/util/bench\_util.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9iZW5jaF91dGlsLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL2V4YW1wbGVzL3NpbXBsZV91ZGFmLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `94.33% <ø> (-0.24%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/extension.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXh0ZW5zaW9uLnJz) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `67.10% <0.00%> (-0.90%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `88.10% <0.00%> (-0.14%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `79.16% <ø> (+0.37%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/projection.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3Byb2plY3Rpb24ucnM=) | `84.93% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/udaf.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3VkYWYucnM=) | `78.94% <ø> (ø)` | |
   | ... and [54 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [3fa8f79...a69e099](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562383540



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -60,10 +59,15 @@ pub enum Signature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// fixed number of arguments of vector of vectors of valid types
+    // A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])`
+    // A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+    // A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])`
+    Uniform(Vec<Vec<Vec<DataType>>>),

Review comment:
       This signature generalizes `UniformEqual`, so, wouldn't it be possible generalize the other instead of creating a new one (replace the existing one by the more general form)?
   
   `Signature` should be such that its variants form a complete set of options without overlaps.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()
+            })
+        })
+        .collect())
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
     let mut builder = StringBuilder::new(args.len());
     // for each entry in the array
     for index in 0..args[0].len() {
         let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
         for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
-            } else {
+            if arg.is_valid(index) {
                 owned_string.push_str(&arg.value(index));
             }
         }
-        if is_null {
+        builder.append_value(&owned_string)?;
+    }
+    Ok(builder.finish())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    let mut builder = StringBuilder::new(args.len());
+    // for each entry in the array
+    for index in 0..args[0].len() {
+        let mut owned_string: String = "".to_owned();
+        if args[0].is_null(index) {
             builder.append_null()?;
         } else {
+            let sep = args[0].value(index);
+            for arg_index in 1..args.len() {
+                let arg = &args[arg_index];
+                if !arg.is_null(index) {

Review comment:
       [optional: This can be simplified, generalized and become more performant by using `collect`.]

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -499,20 +692,42 @@ fn signature(fun: &BuiltinScalarFunction) -> Signature {
     // for now, the list is small, as we do not have many built-in functions.

Review comment:
       this can go now xD

##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       Wont this coerce any type to the first variant, even if the latter variant is accepted?
   
   I.e. if we use
   
   ```
   Uniform(vec![
       vec![vec![A]],
       vec![vec![B]],
   ])
   ```
   
   and pass arg types `vec![B]`, I would expect that no coercion would happen, but I suspect that this will coerce `B` to `A`, because the first entry with the same number of arguments is `vec![vec![A]]`.
   

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,38 +34,340 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {

Review comment:
       I think that this could be `Result<GenericStringArray<T>>` so that it supports both String and LargeString.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()
+            })
+        })
+        .collect())
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
     let mut builder = StringBuilder::new(args.len());
     // for each entry in the array
     for index in 0..args[0].len() {
         let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
         for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
-            } else {
+            if arg.is_valid(index) {
                 owned_string.push_str(&arg.value(index));
             }
         }
-        if is_null {
+        builder.append_value(&owned_string)?;
+    }
+    Ok(builder.finish())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    let mut builder = StringBuilder::new(args.len());
+    // for each entry in the array
+    for index in 0..args[0].len() {
+        let mut owned_string: String = "".to_owned();
+        if args[0].is_null(index) {
             builder.append_null()?;
         } else {
+            let sep = args[0].value(index);
+            for arg_index in 1..args.len() {
+                let arg = &args[arg_index];
+                if !arg.is_null(index) {
+                    owned_string.push_str(&arg.value(index));
+                    // if not last push separator
+                    if arg_index != args.len() - 1 {
+                        owned_string.push_str(&sep);
+                    }
+                }
+            }
             builder.append_value(&owned_string)?;
-        }
+        };
     }
     Ok(builder.finish())
 }
 
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {

Review comment:
       Same here: `Result<GeneticStringArray<T>>` generalizes this :)

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +34,446 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least one.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()

Review comment:
       Why not error? If we remove that unwrap, the code should compile.

##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       I suggest that we PR this separately with a single function that requires this type of signature, as we need to get this requires much more care than the other parts of this PR as it affects all future functions that use it.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       *and remove this `Ok` from here, so that `collect` is implicitly treated as `collect<Result<_>>` instead of `collect<_>`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745239



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());
+                    } else {
+                        char_vector.push(c.to_ascii_uppercase());
+                    }
+                    wasalnum = ('A'..='Z').contains(&c)
+                        || ('a'..='z').contains(&c)
+                        || ('0'..='9').contains(&c);
+                }
+                char_vector.iter().collect::<String>()
+            })
+        })
+        .collect())
+}
+
+/// Returns first n characters in the string, or when n is negative, returns all but last |n| characters.
+pub fn left<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
             } else {
-                owned_string.push_str(&arg.value(index));
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => {
+                            x.char_indices().nth(n as usize).map_or(x, |(i, _)| &x[..i])
+                        }
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[..i + 1]),
+                    }
+                })
             }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
+pub fn lpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.insert_str(0, " ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
         }
-        if is_null {
-            builder.append_null()?;
-        } else {
-            builder.append_value(&owned_string)?;
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.insert_str(
+                                    0,
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "lpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start of string.
+pub fn ltrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_start()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
         }
+        other => Err(DataFusionError::Internal(format!(
+            "ltrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn repeat<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let number_array: &Int64Array =
+        args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if number_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.repeat(number_array.value(i) as usize))
+            }
+        })
+        .collect())
+}
+
+/// Replaces all occurrences in string of substring from with substring to.
+pub fn replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.replace(from_array.value(i), to_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+// used to replace POSIX capture groups (like \1) with Rust Regex group (like ${1})
+fn regex_replace_posix_groups(replacement: &str) -> String {
+    lazy_static! {
+        static ref CAPTURE_GROUPS_RE: Regex = Regex::new("(\\\\)(\\d*)").unwrap();
     }
-    Ok(builder.finish())
+    CAPTURE_GROUPS_RE
+        .replace_all(replacement, "$${$2}")
+        .into_owned()
+}
+
+/// Replaces substring(s) matching a POSIX regular expression
+pub fn regexp_replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    // creating Regex is expensive so create hashmap for memoization
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+
+    match args.len() {
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let pattern = pattern_array.value(i).to_string();
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+                            re.map(|re| re.replace(x, replacement.as_str()))
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        4 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let flags_array: &StringArray = args[3]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) || flags_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+
+                            let flags = flags_array.value(i);
+                            let (pattern, replace_all) = if flags == "g" {
+                                (pattern_array.value(i).to_string(), true)
+                            } else if flags.contains('g') {
+                                (format!("(?{}){}", flags.to_string().replace("g", ""), pattern_array.value(i)), true)
+                            } else {
+                                (format!("(?{}){}", flags, pattern_array.value(i)), false)
+                            };
+
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+
+                            re.map(|re| {
+                                if replace_all {
+                                    re.replace_all(x, replacement.as_str())
+                                } else {
+                                    re.replace(x, replacement.as_str())
+                                }
+                            })
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "regexp_replace was called with {} arguments. It requires at least 3 and at most 4.",
+            other
+        ))),
+    }
+}
+
+/// Reverses the order of the characters in the string.
+pub fn reverse<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).rev().collect::<String>()))
+        .collect())
+}
+
+/// Returns last n characters in the string, or when n is negative, returns all but first |n| characters.
+pub fn right<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => x
+                            .char_indices()
+                            .nth(n as usize)
+                            .map_or(x, |(i, _)| &x[i + 1..]),
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[i..]),
+                    }
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
+pub fn rpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.push_str(" ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.push_str(
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the end of string.
+pub fn rtrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_end()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rtrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn split_part<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let delimiter_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let n_array: &Int64Array = args[2].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if delimiter_array.is_null(i) || n_array.is_null(i) {
+                Ok(None)
+            } else {
+                x.map(|x: &str| {
+                    let delimiter = delimiter_array.value(i);
+                    let n = n_array.value(i);
+                    if n <= 0 {
+                        Err(DataFusionError::Execution(
+                            "negative substring length not allowed".to_string(),
+                        ))
+                    } else {
+                        let v: Vec<&str> = x.split(delimiter).collect();
+                        match v.get(n as usize - 1) {
+                            Some(s) => Ok(*s),
+                            None => Ok(""),
+                        }
+                    }
+                })
+                .transpose()
+            }
+        })
+        .collect()
+}
+
+/// Returns true if string starts with prefix.
+pub fn starts_with<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<BooleanArray> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let prefix_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if prefix_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.starts_with(prefix_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+/// Returns starting index of specified substring within string, or zero if it's not present. (Same as position(substring in string), but note the reversed argument order.)
+pub fn strpos_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let substring_array: &GenericStringArray<i32> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal(
+                "could not cast substring to StringArray".to_string(),
+            )
+        })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if substring_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let substring: &str = substring_array.value(i);
+                    // the rfind method returns the byte index which may or may not be the same as the character index due to UTF8 encoding
+                    // this method first finds the matching byte using rfind
+                    // then maps that to the character index by matching on the grapheme_index of the byte_index
+                    x.to_string().rfind(substring).map_or(0, |byte_offset| {
+                        x.grapheme_indices(true)
+                            .collect::<Vec<(usize, &str)>>()
+                            .iter()
+                            .enumerate()
+                            .filter(|(_, (offset, _))| *offset == byte_offset)
+                            .map(|(index, _)| index as i32)
+                            .collect::<Vec<i32>>()
+                            .first()
+                            .unwrap()
+                            .to_owned()
+                            + 1
+                    })
+                })
+            }
+        })
+        .collect())
+}
+
+/// Returns starting index of specified substring within string, or zero if it's not present. (Same as position(substring in string), but note the reversed argument order.)
+pub fn strpos_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let substring_array: &GenericStringArray<i64> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal(
+                "could not cast substring to StringArray".to_string(),
+            )
+        })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if substring_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let substring: &str = substring_array.value(i);
+                    // the rfind method returns the byte index which may or may not be the same as the character index due to UTF8 encoding
+                    // this method first finds the matching byte using rfind
+                    // then maps that to the character index by matching on the grapheme_index of the byte_index
+                    x.to_string().rfind(substring).map_or(0, |byte_offset| {
+                        x.grapheme_indices(true)
+                            .collect::<Vec<(usize, &str)>>()
+                            .iter()
+                            .enumerate()
+                            .filter(|(_, (offset, _))| *offset == byte_offset)
+                            .map(|(index, _)| index as i64)
+                            .collect::<Vec<i64>>()
+                            .first()
+                            .unwrap()
+                            .to_owned()
+                            + 1
+                    })
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extracts the substring of string starting at the start'th character, and extending for count characters if that is specified. (Same as substring(string from start for count).)
+pub fn substr<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast string to StringArray".to_string(),
+                    )
+                })?;
+
+            let start_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast start to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if start_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let start: i64 = start_array.value(i);
+                            let start_usize = start as usize;
+                            if start <= 0 {
+                                x
+                            } else if x.len() < start_usize {
+                                ""
+                            } else {
+                                &x[start_usize - 1..]
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast string to StringArray".to_string(),
+                    )
+                })?;
+
+            let start_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast start to Int64Array".to_string(),
+                    )
+                })?;
+
+            let count_array: &Int64Array = args[2]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast count to Int64Array".to_string(),
+                    )
+                })?;
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if start_array.is_null(i) || count_array.is_null(i) {
+                        Ok(None)
+                    } else {
+                        x.map(|x: &str| {
+                            let start: i64 = start_array.value(i);
+                            let count = count_array.value(i);
+
+                            let start_pos = (start as usize) - 1;
+                            let count_usize = count as usize;
+
+                            if count < 0 {
+                                Err(DataFusionError::Execution(
+                                    "negative substring length not allowed".to_string(),
+                                ))
+                            } else if start <= 0 {
+                                Ok(x)
+                            } else if x.len() < start_pos {
+                                Ok("")
+                            } else if x.len() < start_pos + count_usize {
+                                Ok(&x[start_pos..])
+                            } else {
+                                Ok(&x[start_pos..start_pos + count_usize])
+                            }
+                        })
+                        .transpose()
+                    }
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "substr was called with {} arguments. It requires 2 or 3.",
+            other
+        ))),
+    }
+}
+
+/// Converts the number to its equivalent hexadecimal representation.
+pub fn to_hex_i32(args: &[ArrayRef]) -> Result<GenericStringArray<i32>> {
+    let integer_array: &Int32Array =
+        args[0].as_any().downcast_ref::<Int32Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(integer_array
+        .iter()
+        .map(|x| x.map(|x| format!("{:x}", x)))
+        .collect())
+}
+
+/// Converts the number to its equivalent hexadecimal representation.
+pub fn to_hex_i64(args: &[ArrayRef]) -> Result<GenericStringArray<i64>> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(integer_array
+        .iter()
+        .map(|x| x.map(|x| format!("{:x}", x)))
+        .collect())
+}
+
+/// Replaces each character in string that matches a character in the from set with the corresponding character in the to set. If from is longer than to, occurrences of the extra characters in from are deleted.
+pub fn translate<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let from = from_array.value(i).chars().collect::<Vec<char>>();
+                    // create a hashmap to change from O(n) to O(1) from lookup
+                    let mut from_map: HashMap<char, usize> =
+                        HashMap::with_capacity(from.len());
+                    from.iter().enumerate().for_each(|(index, c)| {
+                        from_map.insert(c.to_owned(), index);
+                    });

Review comment:
       this is a good idea as it probably lets the compiler optimize better. Thanks :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575718838



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(120)))],
+            Some("x"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aabbcc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aacc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![lit(ScalarValue::Utf8(None))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|bb|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("hi THOMAS".to_string())))],
+            Some("Hi Thomas"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("".to_string())))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("ab"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("abc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("abcdefabcdefabcdefahi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Utf8(Some("5".to_string()))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("LOWER".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("lower".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("Thomas".to_string()))),
+                lit(ScalarValue::Utf8(Some(".[mN]a.".to_string()))),
+                lit(ScalarValue::Utf8(Some("M".to_string()))),
+            ],
+            Some("ThM"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+            ],
+            Some("fooXbaz"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXarYXazY"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("gi".to_string()))),
+            ],
+            Some("XXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("i".to_string()))),
+            ],
+            Some("XabcABC"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            Some("PgPgPgPg"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abXXefabXXef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("notmatch".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abcdefabcdef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("abcde".to_string())))],
+            Some("edcba"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("loẅks".to_string())))],
+            Some("skẅol"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("de"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("cde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("hixyx"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("hiabcdefabcdefabcdefa"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some(" trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(

Review comment:
       Good catch and this was indeed incorrectly implemented. I have fixed the functions and added some tests. Thanks :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-778716368


   @alamb Thanks for your extreme attention to detail and yes it is absolutely IMPLEMENT ALL THE FUNCTIONS 😆 
   
   I have addressed and resolved most of the comments you have made. The remaining unresolved comments do require further discussion.
   
   I am happy to do the split based on your suggestions and I'm ok to raise the tickets:
   
   - `bit_length` kernels + `length` comments
   - `Signature::OneOf`
   - Length functions (BitLength, etc)
   - Ascii/unicode functions
   - Regex functions
   - Pad/trim functions
   
   Obviously this is a lot of work but this should allow us to split up the reviews more fairly. I will start the PR-mageddon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-777770237


   I plan to try and review this probably this weekend. I wonder if we should update the title to remove the `"WIP"`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562390994



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       *and remove this `Ok` from here, so that `collect` is implicitly treated as `collect<Result<_>>` instead of `collect<_>`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (64abbc5) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.18%`.
   > The diff coverage is `93.35%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   81.61%   81.79%   +0.18%     
   ==========================================
     Files         215      215              
     Lines       51867    53138    +1271     
   ==========================================
   + Hits        42329    43466    +1137     
   - Misses       9538     9672     +134     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `78.94% <ø> (+1.81%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `85.56% <85.39%> (-1.39%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.44% <92.42%> (-4.10%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `84.27% <94.47%> (+11.97%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/aggregates.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMucnM=) | `91.13% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.84% <100.00%> (+<0.01%)` | :arrow_up: |
   | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.67% <100.00%> (+0.17%)` | :arrow_up: |
   | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `83.21% <0.00%> (-9.89%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `79.75% <0.00%> (-6.52%)` | :arrow_down: |
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `6.97% <0.00%> (-5.22%)` | :arrow_down: |
   | ... and [94 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [1401359...64abbc5](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562391477



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       Ah of course 🤦 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (ff5a8df) into [master](https://codecov.io/gh/apache/arrow/commit/ab5fc979c69ccc5dde07e1bc1467b02951b4b7e9?el=desc) (ab5fc97) will **increase** coverage by `0.22%`.
   > The diff coverage is `92.60%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   81.89%   82.11%   +0.22%     
   ==========================================
     Files         215      216       +1     
     Lines       52988    54162    +1174     
   ==========================================
   + Hits        43392    44477    +1085     
   - Misses       9596     9685      +89     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `80.76% <ø> (+1.00%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `87.78% <88.26%> (+0.28%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.38% <91.52%> (-4.16%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `84.96% <92.24%> (+12.66%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/octet\_length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL29jdGV0X2xlbmd0aC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.85% <100.00%> (+0.02%)` | :arrow_up: |
   | [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `60.46% <0.00%> (-4.54%)` | :arrow_down: |
   | [rust/arrow/src/array/equal/boolean.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvZXF1YWwvYm9vbGVhbi5ycw==) | `97.56% <0.00%> (-2.44%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/csv.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Nzdi5ycw==) | `73.04% <0.00%> (-1.22%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/projection.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3Byb2plY3Rpb24ucnM=) | `84.93% <0.00%> (-0.99%)` | :arrow_down: |
   | ... and [67 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [ab5fc97...ff5a8df](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-773613814


   @alamb @jorgecarleitao @andygrove 
   
   I think these are mostly implemented now. Not sure how we want to do the merge given this change is so large.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (64abbc5) into [master](https://codecov.io/gh/apache/arrow/commit/1393188e1aa1b3d59993ce7d4ade7f7ac8570959?el=desc) (1393188) will **increase** coverage by `0.18%`.
   > The diff coverage is `93.35%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   81.61%   81.79%   +0.18%     
   ==========================================
     Files         215      215              
     Lines       51867    53138    +1271     
   ==========================================
   + Hits        42329    43466    +1137     
   - Misses       9538     9672     +134     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `78.94% <ø> (+1.81%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `85.56% <85.39%> (-1.39%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.44% <92.42%> (-4.10%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `84.27% <94.47%> (+11.97%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/aggregates.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMucnM=) | `91.13% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.84% <100.00%> (+<0.01%)` | :arrow_up: |
   | [rust/parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9hcnJvdy9zY2hlbWEucnM=) | `91.67% <100.00%> (+0.17%)` | :arrow_up: |
   | [rust/arrow/src/array/array\_list.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfbGlzdC5ycw==) | `83.21% <0.00%> (-9.89%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `79.75% <0.00%> (-6.52%)` | :arrow_down: |
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `6.97% <0.00%> (-5.22%)` | :arrow_down: |
   | ... and [94 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [1401359...64abbc5](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-778761707


   @seddonm1  -- I merged  https://github.com/apache/arrow/pull/9376, which, as you predicated, causes a bunch of conflicts.
   
   Given this PR probably needs a bunch of rework now anyways, if splitting it up into pieces while doing so might not be that much extra work


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-777776398


   Thanks @alamb . I know the prospect of doing a review like this is not something to look forward to. I will rebase and push soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575832238



##########
File path: rust/datafusion/Cargo.toml
##########
@@ -64,6 +64,9 @@ log = "^0.4"
 md-5 = "^0.9.1"
 sha2 = "^0.9.1"
 ordered-float = "2.0"
+unicode-segmentation = "^1.7.1"

Review comment:
       FWIW, `regex` / `lazy_static` are already non-optional dependencies of `arrow`, so I think not that much can be gained there, unless we make it optional in Arrow as well.
   
   I think it is a good idea to make some features optional, to reduce compile times whenever you are not working on them.
   
   Another thing we can do to split benchmarks / examples / etc. out of the crate to make compile times a bit shorter, which I started doing hereL https://github.com/apache/arrow/pull/9494 and https://github.com/apache/arrow/pull/9493
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-777045974


   @andygrove @alamb @jorgecarleitao 
   Here is the big PR that I was talking about in the Arrow call. I can rebase easily enough but I guess apart from the significant number of new lines (a lot of boilerplate) the key question is (from above):
   
   I think we need this Signature::OneOf. A good example is `lpad` which is either:
   [[utf8, largeutf8], int] or [[utf8, largeutf8], int, [utf8, largeutf8]] signature. You can see my implementation here but perhaps you have a better ideas and I don't know who wrote the original code. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (77e7222) into [master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc) (5e3fcfa) will **increase** coverage by `0.25%`.
   > The diff coverage is `91.91%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   82.32%   82.57%   +0.25%     
   ==========================================
     Files         233      235       +2     
     Lines       54446    56294    +1848     
   ==========================================
   + Hits        44823    46487    +1664     
   - Misses       9623     9807     +184     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `80.99% <ø> (+0.76%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `84.53% <84.69%> (-2.43%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.35% <91.22%> (-4.19%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `87.67% <92.55%> (+15.37%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/bit\_length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2JpdF9sZW5ndGgucnM=) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/compute/kernels/length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xlbmd0aC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.87% <100.00%> (+0.01%)` | :arrow_up: |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `75.90% <0.00%> (-12.63%)` | :arrow_down: |
   | [rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==) | `87.50% <0.00%> (-5.10%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `65.27% <0.00%> (-1.83%)` | :arrow_down: |
   | ... and [49 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [5e3fcfa...77e7222](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745153



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());
+                    } else {
+                        char_vector.push(c.to_ascii_uppercase());
+                    }
+                    wasalnum = ('A'..='Z').contains(&c)
+                        || ('a'..='z').contains(&c)
+                        || ('0'..='9').contains(&c);
+                }
+                char_vector.iter().collect::<String>()
+            })
+        })
+        .collect())
+}
+
+/// Returns first n characters in the string, or when n is negative, returns all but last |n| characters.
+pub fn left<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
             } else {
-                owned_string.push_str(&arg.value(index));
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => {
+                            x.char_indices().nth(n as usize).map_or(x, |(i, _)| &x[..i])
+                        }
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[..i + 1]),
+                    }
+                })
             }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
+pub fn lpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.insert_str(0, " ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
         }
-        if is_null {
-            builder.append_null()?;
-        } else {
-            builder.append_value(&owned_string)?;
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.insert_str(
+                                    0,
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "lpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start of string.
+pub fn ltrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_start()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
         }
+        other => Err(DataFusionError::Internal(format!(
+            "ltrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn repeat<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let number_array: &Int64Array =
+        args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if number_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.repeat(number_array.value(i) as usize))
+            }
+        })
+        .collect())
+}
+
+/// Replaces all occurrences in string of substring from with substring to.
+pub fn replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.replace(from_array.value(i), to_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+// used to replace POSIX capture groups (like \1) with Rust Regex group (like ${1})
+fn regex_replace_posix_groups(replacement: &str) -> String {
+    lazy_static! {
+        static ref CAPTURE_GROUPS_RE: Regex = Regex::new("(\\\\)(\\d*)").unwrap();
     }
-    Ok(builder.finish())
+    CAPTURE_GROUPS_RE
+        .replace_all(replacement, "$${$2}")
+        .into_owned()
+}
+
+/// Replaces substring(s) matching a POSIX regular expression
+pub fn regexp_replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    // creating Regex is expensive so create hashmap for memoization
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+
+    match args.len() {
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let pattern = pattern_array.value(i).to_string();
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+                            re.map(|re| re.replace(x, replacement.as_str()))
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        4 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let flags_array: &StringArray = args[3]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) || flags_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+
+                            let flags = flags_array.value(i);
+                            let (pattern, replace_all) = if flags == "g" {
+                                (pattern_array.value(i).to_string(), true)
+                            } else if flags.contains('g') {
+                                (format!("(?{}){}", flags.to_string().replace("g", ""), pattern_array.value(i)), true)
+                            } else {
+                                (format!("(?{}){}", flags, pattern_array.value(i)), false)
+                            };
+
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+
+                            re.map(|re| {
+                                if replace_all {
+                                    re.replace_all(x, replacement.as_str())
+                                } else {
+                                    re.replace(x, replacement.as_str())
+                                }
+                            })
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "regexp_replace was called with {} arguments. It requires at least 3 and at most 4.",
+            other
+        ))),
+    }
+}
+
+/// Reverses the order of the characters in the string.
+pub fn reverse<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).rev().collect::<String>()))
+        .collect())
+}
+
+/// Returns last n characters in the string, or when n is negative, returns all but first |n| characters.
+pub fn right<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => x
+                            .char_indices()
+                            .nth(n as usize)
+                            .map_or(x, |(i, _)| &x[i + 1..]),
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[i..]),
+                    }
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
+pub fn rpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.push_str(" ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.push_str(
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the end of string.
+pub fn rtrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_end()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rtrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn split_part<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let delimiter_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let n_array: &Int64Array = args[2].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if delimiter_array.is_null(i) || n_array.is_null(i) {
+                Ok(None)
+            } else {
+                x.map(|x: &str| {
+                    let delimiter = delimiter_array.value(i);
+                    let n = n_array.value(i);
+                    if n <= 0 {
+                        Err(DataFusionError::Execution(
+                            "negative substring length not allowed".to_string(),

Review comment:
       thanks for finding this. I have fixed to be the same error message as Postgres




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (5a90cf8) into [master](https://codecov.io/gh/apache/arrow/commit/ab5fc979c69ccc5dde07e1bc1467b02951b4b7e9?el=desc) (ab5fc97) will **increase** coverage by `0.16%`.
   > The diff coverage is `92.41%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   81.89%   82.05%   +0.16%     
   ==========================================
     Files         215      215              
     Lines       52988    53787     +799     
   ==========================================
   + Hits        43392    44136     +744     
   - Misses       9596     9651      +55     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `80.57% <ø> (+0.81%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `86.89% <86.85%> (-0.61%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.44% <92.42%> (-4.10%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `84.84% <92.61%> (+12.54%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/aggregates.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FnZ3JlZ2F0ZXMucnM=) | `91.13% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.84% <100.00%> (+0.01%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `95.24% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zLnJz) | `81.19% <0.00%> (+0.11%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `97.11% <0.00%> (+0.12%)` | :arrow_up: |
   | ... and [2 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [ab5fc97...5a90cf8](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562391973



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -60,10 +59,15 @@ pub enum Signature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// fixed number of arguments of vector of vectors of valid types
+    // A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])`
+    // A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+    // A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])`
+    Uniform(Vec<Vec<Vec<DataType>>>),

Review comment:
       Yes. Agree. The existing code clearly took some thought so wanted to leave it until we can agree correct course of action.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (f950bc6) into [master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc) (5e3fcfa) will **increase** coverage by `0.26%`.
   > The diff coverage is `92.01%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   82.32%   82.58%   +0.26%     
   ==========================================
     Files         233      235       +2     
     Lines       54446    56316    +1870     
   ==========================================
   + Hits        44823    46510    +1687     
   - Misses       9623     9806     +183     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `80.99% <ø> (+0.76%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `84.59% <84.76%> (-2.36%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.35% <91.22%> (-4.19%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `87.87% <92.70%> (+15.57%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/bit\_length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2JpdF9sZW5ndGgucnM=) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/compute/kernels/length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xlbmd0aC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.87% <100.00%> (+0.01%)` | :arrow_up: |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `75.90% <0.00%> (-12.63%)` | :arrow_down: |
   | [rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==) | `87.50% <0.00%> (-5.10%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `65.27% <0.00%> (-1.83%)` | :arrow_down: |
   | ... and [49 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [5e3fcfa...f950bc6](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 closed pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
seddonm1 closed pull request #9243:
URL: https://github.com/apache/arrow/pull/9243


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567200430



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       Can't we `.transpose()`? `Option<Result<Ptr>>::transpose() -> Result<Option<Ptr>>`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-778762535


   @seddonm1, If you need hands, just ping me with function names you would like me to work on and I will pick them up and PR then to this branch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (fa02182) into [master](https://codecov.io/gh/apache/arrow/commit/924449eba36acda22ccb319e8de8921c090a4cd2?el=desc) (924449e) will **increase** coverage by `0.35%`.
   > The diff coverage is `92.03%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   82.29%   82.64%   +0.35%     
   ==========================================
     Files         244      245       +1     
     Lines       55616    57408    +1792     
   ==========================================
   + Hits        45767    47443    +1676     
   - Misses       9849     9965     +116     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `81.56% <ø> (+0.42%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `83.21% <85.35%> (+13.59%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `89.43% <92.49%> (+15.61%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `96.91% <97.05%> (-1.71%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/bit\_length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2JpdF9sZW5ndGgucnM=) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.93% <100.00%> (+<0.01%)` | :arrow_up: |
   | [rust/parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9wYXJxdWV0L3NyYy9lbmNvZGluZ3MvZW5jb2RpbmcucnM=) | `94.86% <0.00%> (-0.20%)` | :arrow_down: |
   | [rust/arrow/src/compute/kernels/cast.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2Nhc3QucnM=) | `97.40% <0.00%> (+0.12%)` | :arrow_up: |
   | ... and [7 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [924449e...fa02182](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (e80da7f) into [master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc) (5e3fcfa) will **increase** coverage by `0.26%`.
   > The diff coverage is `92.10%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   82.32%   82.59%   +0.26%     
   ==========================================
     Files         233      235       +2     
     Lines       54446    56338    +1892     
   ==========================================
   + Hits        44823    46532    +1709     
   - Misses       9623     9806     +183     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/logical\_plan/expr.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXhwci5ycw==) | `80.99% <ø> (+0.76%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `84.75% <84.92%> (-2.20%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/type\_coercion.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3R5cGVfY29lcmNpb24ucnM=) | `94.35% <91.22%> (-4.19%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/functions.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2Z1bmN0aW9ucy5ycw==) | `88.03% <92.82%> (+15.74%)` | :arrow_up: |
   | [rust/arrow/src/compute/kernels/bit\_length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2JpdF9sZW5ndGgucnM=) | `100.00% <100.00%> (ø)` | |
   | [rust/arrow/src/compute/kernels/length.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL2xlbmd0aC5ycw==) | `100.00% <100.00%> (ø)` | |
   | [rust/datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL3NxbC5ycw==) | `99.87% <100.00%> (+0.01%)` | :arrow_up: |
   | [rust/arrow/src/array/array.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXkucnM=) | `75.90% <0.00%> (-12.63%)` | :arrow_down: |
   | [rust/arrow/src/array/null.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvbnVsbC5ycw==) | `87.50% <0.00%> (-5.10%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `65.27% <0.00%> (-1.83%)` | :arrow_down: |
   | ... and [49 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [5e3fcfa...e80da7f](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575744933



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());

Review comment:
       This is 100% the same as the Postgres implementation (I had to look at their source code) including the `wasalnum` variable name. It is not perfect but does align.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-779486975


   @alamb @jorgecarleitao I have applied the new API (mostly using the `make_scalar_function` helper) to all the functions.
   
   My final question before doing the separate PRs is regarding the Postgres license (https://www.postgresql.org/about/licence/). While doing this PR and aiming for Postgres compatibility I have looked at the Postgres source code, used their documentation examples as test cases and used their documentation descriptions (as they are very clear and well written). I think this license is required as a byproduct of adopting the Postgres SQL standard and ensuring compatability.
   
   E.g.
   
   Postgres Documentation (https://www.postgresql.org/docs/13/functions-string.html):
   ```
   Converts the string to all upper case, according to the rules of the database's locale.
   upper('tom') → TOM
   ```
   
   You can see how Materialize have addressed it: https://github.com/MaterializeInc/materialize/blob/main/LICENSE#L327.
   
   Comment:
   ```
   /// Converts the string to all upper case.
   /// upper('tom') = 'TOM'
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562391162



##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       Thanks @jorgecarleitao . Yes I will split this out. 
   
   A good example is lpad which is either:
   [string, int] or [string, int, string]. I am away a couple of days but will split this out so we can work throught methodically.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575716593



##########
File path: rust/arrow/src/compute/kernels/bit_length.rs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines kernel for length of a string array
+
+use crate::{array::*, buffer::Buffer};
+use crate::{
+    datatypes::DataType,
+    error::{ArrowError, Result},
+};
+use std::sync::Arc;
+
+fn bit_length_string<OffsetSize>(array: &Array, data_type: DataType) -> Result<ArrayRef>
+where
+    OffsetSize: OffsetSizeTrait,
+{
+    // note: offsets are stored as u8, but they can be interpreted as OffsetSize
+    let offsets = &array.data_ref().buffers()[0];
+    // this is a 30% improvement over iterating over u8s and building OffsetSize, which
+    // justifies the usage of `unsafe`.
+    let slice: &[OffsetSize] =
+        &unsafe { offsets.typed_data::<OffsetSize>() }[array.offset()..];
+
+    let bit_size = OffsetSize::from_usize(8).unwrap();
+    let lengths = slice
+        .windows(2)
+        .map(|offset| (offset[1] - offset[0]) * bit_size);
+
+    // JUSTIFICATION
+    //  Benefit
+    //      ~60% speedup
+    //  Soundness
+    //      `values` is an iterator with a known size.
+    let buffer = unsafe { Buffer::from_trusted_len_iter(lengths) };
+
+    let null_bit_buffer = array
+        .data_ref()
+        .null_bitmap()
+        .as_ref()
+        .map(|b| b.bits.clone());
+
+    let data = ArrayData::new(
+        data_type,
+        array.len(),
+        None,
+        null_bit_buffer,
+        0,
+        vec![buffer],
+        vec![],
+    );
+    Ok(make_array(Arc::new(data)))
+}
+
+/// Returns an array of Int32/Int64 denoting the number of bits in each string in the array.
+///
+/// * this only accepts StringArray/Utf8 and LargeString/LargeUtf8
+/// * bit_length of null is null.
+/// * bit_length is in number of bits
+pub fn bit_length(array: &Array) -> Result<ArrayRef> {
+    match array.data_type() {
+        DataType::Utf8 => bit_length_string::<i32>(array, DataType::Int32),
+        DataType::LargeUtf8 => bit_length_string::<i64>(array, DataType::Int64),
+        _ => Err(ArrowError::ComputeError(format!(
+            "bit_length not supported for {:?}",
+            array.data_type()
+        ))),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn cases() -> Vec<(Vec<&'static str>, usize, Vec<i32>)> {
+        fn double_vec<T: Clone>(v: Vec<T>) -> Vec<T> {
+            [&v[..], &v[..]].concat()
+        }
+
+        // a large array
+        let mut values = vec!["one", "on", "o", ""];
+        let mut expected = vec![24, 16, 8, 0];
+        for _ in 0..10 {
+            values = double_vec(values);
+            expected = double_vec(expected);
+        }
+
+        vec![
+            (vec!["hello", " ", "world"], 3, vec![40, 8, 40]),

Review comment:
       This was purely copied from the `length` kernel but updated with the correct values for number of bits not bytes. I have removed both.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567146994



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -60,10 +59,15 @@ pub enum Signature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// fixed number of arguments of vector of vectors of valid types
+    // A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])`
+    // A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+    // A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])`
+    Uniform(Vec<Vec<Vec<DataType>>>),

Review comment:
       @jorgecarleitao 
   I have split this code out (renamed to `OneOf` with the `lpad` function to demonstrate its purpose) here: https://github.com/seddonm1/arrow/tree/oneof-function-signature
   
   I would appreciate some of your brain time to help resolve this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-762028087


   https://issues.apache.org/jira/browse/ARROW-11298


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-782937444


   @alamb FYI i have just rebased against master. @Dandandan has already added the `Signature::OneOf` functionality so I have rebased against that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745308



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(120)))],
+            Some("x"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aabbcc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aacc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![lit(ScalarValue::Utf8(None))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|bb|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("hi THOMAS".to_string())))],
+            Some("Hi Thomas"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(Some("".to_string())))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::InitCap,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("ab"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("abc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Left,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("abcdefabcdefabcdefahi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("   hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Utf8(Some("5".to_string()))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("xyxhi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("LOWER".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(Some("lower".to_string())))],
+            Some("lower"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Lower,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Ltrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("Thomas".to_string()))),
+                lit(ScalarValue::Utf8(Some(".[mN]a.".to_string()))),
+                lit(ScalarValue::Utf8(Some("M".to_string()))),
+            ],
+            Some("ThM"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+            ],
+            Some("fooXbaz"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b..".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            Some("fooXarYXazY"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("g".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("foobarbaz".to_string()))),
+                lit(ScalarValue::Utf8(Some("b(..)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X\\1Y".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("gi".to_string()))),
+            ],
+            Some("XXX"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::RegexpReplace,
+            vec![
+                lit(ScalarValue::Utf8(Some("ABCabcABC".to_string()))),
+                lit(ScalarValue::Utf8(Some("(abc)".to_string()))),
+                lit(ScalarValue::Utf8(Some("X".to_string()))),
+                lit(ScalarValue::Utf8(Some("i".to_string()))),
+            ],
+            Some("XabcABC"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            Some("PgPgPgPg"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(4))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Repeat,
+            vec![
+                lit(ScalarValue::Utf8(Some("Pg".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abXXefabXXef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("notmatch".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            Some("abcdefabcdef"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("XX".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Replace,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcdefabcdef".to_string()))),
+                lit(ScalarValue::Utf8(Some("cd".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("abcde".to_string())))],
+            Some("edcba"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(Some("loẅks".to_string())))],
+            Some("skẅol"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Reverse,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int8(Some(2))),
+            ],
+            Some("de"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(200))),
+            ],
+            Some("abcde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-2))),
+            ],
+            Some("cde"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(-200))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Right,
+            vec![
+                lit(ScalarValue::Utf8(Some("abcde".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            Some("hixyx"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(21))),
+                lit(ScalarValue::Utf8(Some("abcdef".to_string()))),
+            ],
+            Some("hiabcdefabcdefabcdefa"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some(" ".to_string()))),
+            ],
+            Some("hi   "),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("".to_string()))),
+            ],
+            Some("hi"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Utf8(Some("xy".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rpad,
+            vec![
+                lit(ScalarValue::Utf8(Some("hi".to_string()))),
+                lit(ScalarValue::Int64(Some(5))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some(" trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(Some("trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Rtrim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(Some(" trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(Some("trim ".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(Some(" trim".to_string())))],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Trim,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::SplitPart,
+            vec![
+                lit(ScalarValue::Utf8(Some("abc~@~def~@~ghi".to_string()))),
+                lit(ScalarValue::Utf8(Some("~@~".to_string()))),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            Some("def"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::SplitPart,
+            vec![
+                lit(ScalarValue::Utf8(Some("abc~@~def~@~ghi".to_string()))),
+                lit(ScalarValue::Utf8(Some("~@~".to_string()))),
+                lit(ScalarValue::Int64(Some(20))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(0))),
+            ],
+            Some("alphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(1))),
+            ],
+            Some("alphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            Some("lphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+            ],
+            Some("phabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(-3))),
+            ],
+            Some("alphabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(30))),
+            ],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+                lit(ScalarValue::Int64(Some(2))),
+            ],
+            Some("ph"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+                lit(ScalarValue::Int64(Some(20))),
+            ],
+            Some("phabet"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(None)),
+                lit(ScalarValue::Int64(Some(20))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Substr,
+            vec![
+                lit(ScalarValue::Utf8(Some("alphabet".to_string()))),
+                lit(ScalarValue::Int64(Some(3))),
+                lit(ScalarValue::Int64(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(Some("12345".to_string()))),
+                lit(ScalarValue::Utf8(Some("143".to_string()))),
+                lit(ScalarValue::Utf8(Some("ax".to_string()))),
+            ],
+            Some("a2x5"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("143".to_string()))),
+                lit(ScalarValue::Utf8(Some("ax".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(Some("12345".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("ax".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Translate,
+            vec![
+                lit(ScalarValue::Utf8(Some("12345".to_string()))),
+                lit(ScalarValue::Utf8(Some("143".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Upper,
+            vec![lit(ScalarValue::Utf8(Some("upper".to_string())))],
+            Some("UPPER"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Upper,
+            vec![lit(ScalarValue::Utf8(Some("UPPER".to_string())))],
+            Some("UPPER"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Upper,
+            vec![lit(ScalarValue::Utf8(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ToHex,
+            vec![lit(ScalarValue::Int32(Some(2147483647)))],
+            Some("7fffffff"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ToHex,
+            vec![lit(ScalarValue::Int32(None))],
+            None,
+        )?;
+
+        Ok(())
+    }
+
+    fn generic_string_i32_function(

Review comment:
       I have done a major update since you reviewed so all functions in `functions.rs` now use a single macro which also checks expected errors. I had planned to do this but never got to it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575716759



##########
File path: rust/datafusion/Cargo.toml
##########
@@ -64,6 +64,9 @@ log = "^0.4"
 md-5 = "^0.9.1"
 sha2 = "^0.9.1"
 ordered-float = "2.0"
+unicode-segmentation = "^1.7.1"

Review comment:
       Yes I was nervous about additional dependencies. Perhaps this topic can be raised at the next Arrow Rust call to agree some sort of assessment criteria.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575716963



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(

Review comment:
       This is me just copying Go's table driven testing of which I am a big fan.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745351



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -496,22 +919,77 @@ pub fn create_physical_expr(
 fn signature(fun: &BuiltinScalarFunction) -> Signature {
     // note: the physical expression must accept the type returned by this function or the execution panics.
 
-    // for now, the list is small, as we do not have many built-in functions.

Review comment:
       I will leave this optimisation for now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575717131



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -702,14 +1197,912 @@ mod tests {
         let result = result.as_any().downcast_ref::<StringArray>().unwrap();
 
         // value is correct
-        assert_eq!(result.value(0).to_string(), expected);
+        match expected {
+            Some(v) => assert_eq!(result.value(0), v),
+            None => assert!(result.is_null(0)),
+        };
 
         Ok(())
     }
 
     #[test]
-    fn test_concat_utf8() -> Result<()> {
-        test_concat(ScalarValue::Utf8(Some("aa".to_string())), "aaaa")
+    fn test_string_functions() -> Result<()> {
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            Some("trim"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("xyz".to_string()))),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Btrim,
+            vec![
+                lit(ScalarValue::Utf8(Some("xyxtrimyyx".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(120)))],
+            Some("x"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(Some(128175)))],
+            Some("💯"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Chr,
+            vec![lit(ScalarValue::Int64(None))],
+            None,
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aabbcc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aacc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::Concat,
+            vec![lit(ScalarValue::Utf8(None))],
+            Some(""),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(Some("aa".to_string()))),
+                lit(ScalarValue::Utf8(Some("bb".to_string()))),
+                lit(ScalarValue::Utf8(Some("cc".to_string()))),
+            ],
+            Some("aa|bb|cc"),
+        )?;
+        generic_string_function(
+            BuiltinScalarFunction::ConcatWithSeparator,
+            vec![
+                lit(ScalarValue::Utf8(Some("|".to_string()))),
+                lit(ScalarValue::Utf8(None)),
+            ],
+            Some(""),

Review comment:
       Yes. During development I had a Postgres 13 docker container running and used it to understand their implementation. This one is definitely odd given the behavior of almost all the rest of the functions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-779029428


   @jorgecarleitao OK I have had a good look at the new API  which I believe is on the whole good (and will allow some impressive optimisations) but has led to some very big functions trying to pattern match all the cases: https://github.com/seddonm1/arrow/blob/aaf76f5a7d7a871a2cdb040839376ac1aac29c0b/rust/datafusion/src/physical_plan/string_expressions.rs#L125
   
   So `ltrim` as two different signatures: [`utf8`/`largeutf8`] or [`utf8`/`largeutf8`, `utf8`/`largeutf8`]. Once you expand this out for these signatures and the fact that each argument could be either `array` or `scalar` I am getting 16 unique combinations. I found when trying to actually implement the string functions that the generic case is actually not that common as most functions have different signatures. 
   
   Could you have a look and advise?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r561653478



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +34,446 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least one.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()

Review comment:
       I'm not sure if we should be panicing if these characters appear

##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       Thanks @jorgecarleitao . Yes I will split this out. 
   
   A good example is lpad which is either:
   [string, int] or [string, int, string]. I am away a couple of days but will split this out so we can work throught methodically.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       Ah of course 🤦 

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -60,10 +59,15 @@ pub enum Signature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// fixed number of arguments of vector of vectors of valid types
+    // A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])`
+    // A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+    // A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])`
+    Uniform(Vec<Vec<Vec<DataType>>>),

Review comment:
       Yes. Agree. The existing code clearly took some thought so wanted to leave it until we can agree correct course of action.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575744849



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),

Review comment:
       good catch




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575716599



##########
File path: rust/arrow/src/compute/kernels/bit_length.rs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       Agreed. Will do.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-764451929


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=h1) Report
   > Merging [#9243](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=desc) (a69e099) into [master](https://codecov.io/gh/apache/arrow/commit/77ae93d6ecaac8fb5f4a18ca5287b7456cd88784?el=desc) (77ae93d) will **increase** coverage by `0.38%`.
   > The diff coverage is `88.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9243/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9243      +/-   ##
   ==========================================
   + Coverage   82.00%   82.39%   +0.38%     
   ==========================================
     Files         230      231       +1     
     Lines       53487    55715    +2228     
   ==========================================
   + Hits        43864    45906    +2042     
   - Misses       9623     9809     +186     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/util/bench\_util.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvdXRpbC9iZW5jaF91dGlsLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/examples/simple\_udaf.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL2V4YW1wbGVzL3NpbXBsZV91ZGFmLnJz) | `0.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `94.33% <ø> (-0.24%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/extension.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZXh0ZW5zaW9uLnJz) | `0.00% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/group\_scalar.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2dyb3VwX3NjYWxhci5ycw==) | `67.10% <0.00%> (-0.90%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/parquet.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BhcnF1ZXQucnM=) | `88.10% <0.00%> (-0.14%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `79.16% <ø> (+0.37%)` | :arrow_up: |
   | [rust/datafusion/src/physical\_plan/projection.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3Byb2plY3Rpb24ucnM=) | `84.93% <ø> (ø)` | |
   | [rust/datafusion/src/physical\_plan/udaf.rs](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3VkYWYucnM=) | `78.94% <ø> (ø)` | |
   | ... and [54 more](https://codecov.io/gh/apache/arrow/pull/9243/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=footer). Last update [3fa8f79...a69e099](https://codecov.io/gh/apache/arrow/pull/9243?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-779486975


   @alamb @jorgecarleitao I have applied the new API (mostly using the `make_scalar_function` helper) to all the functions.
   
   My final question before doing the separate PRs is regarding the Postgres license (https://www.postgresql.org/about/licence/). While doing this PR and aiming for Postgres compatibility I have looked at the Postgres source code, used their documentation examples as test cases and used their documentation descriptions (as they are very clear and well written). I think this license is required as a byproduct of adopting the Postgres SQL standard and ensuring compatability.
   
   E.g.
   
   Postgres Documentation (https://www.postgresql.org/docs/13/functions-string.html):
   ```
   Converts the string to all upper case, according to the rules of the database's locale.
   upper('tom') → TOM
   ```
   
   DataFusion Comment:
   ```
   /// Converts the string to all upper case.
   /// upper('tom') = 'TOM'
   ```
   
   You can see how Materialize have addressed it: https://github.com/MaterializeInc/materialize/blob/main/LICENSE#L327.''
   
   Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562389680



##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       I suggest that we PR this separately with a single function that requires this type of signature, as we need to get this requires much more care than the other parts of this PR as it affects all future functions that use it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-790849774


   switching to draft so it is clear this is not being merged as is and instead is being broken up


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-773613814


   @alamb @jorgecarleitao @andygrove 
   
   I think these are mostly implemented now. Not sure how we want to do the merge given this change is so large.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567147472



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()
+            })
+        })
+        .collect())
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
     let mut builder = StringBuilder::new(args.len());
     // for each entry in the array
     for index in 0..args[0].len() {
         let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
         for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
-            } else {
+            if arg.is_valid(index) {
                 owned_string.push_str(&arg.value(index));
             }
         }
-        if is_null {
+        builder.append_value(&owned_string)?;
+    }
+    Ok(builder.finish())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    let mut builder = StringBuilder::new(args.len());
+    // for each entry in the array
+    for index in 0..args[0].len() {
+        let mut owned_string: String = "".to_owned();
+        if args[0].is_null(index) {
             builder.append_null()?;
         } else {
+            let sep = args[0].value(index);
+            for arg_index in 1..args.len() {
+                let arg = &args[arg_index];
+                if !arg.is_null(index) {

Review comment:
       thanks i will have a look at this today
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567202992



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array

Review comment:
       > Can't we `.transpose()`? `Option<Result<Ptr>>::transpose() -> Result<Option<Ptr>>`
   
   Genius! I was unaware of that function and is perfect.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r561653478



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +34,446 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least one.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()

Review comment:
       I'm not sure if we should be panicing if these characters appear




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r562383540



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -60,10 +59,15 @@ pub enum Signature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// fixed number of arguments of vector of vectors of valid types
+    // A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])`
+    // A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+    // A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])`
+    Uniform(Vec<Vec<Vec<DataType>>>),

Review comment:
       This signature generalizes `UniformEqual`, so, wouldn't it be possible generalize the other instead of creating a new one (replace the existing one by the more general form)?
   
   `Signature` should be such that its variants form a complete set of options without overlaps.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()
+            })
+        })
+        .collect())
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
     let mut builder = StringBuilder::new(args.len());
     // for each entry in the array
     for index in 0..args[0].len() {
         let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
         for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
-            } else {
+            if arg.is_valid(index) {
                 owned_string.push_str(&arg.value(index));
             }
         }
-        if is_null {
+        builder.append_value(&owned_string)?;
+    }
+    Ok(builder.finish())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    let mut builder = StringBuilder::new(args.len());
+    // for each entry in the array
+    for index in 0..args[0].len() {
+        let mut owned_string: String = "".to_owned();
+        if args[0].is_null(index) {
             builder.append_null()?;
         } else {
+            let sep = args[0].value(index);
+            for arg_index in 1..args.len() {
+                let arg = &args[arg_index];
+                if !arg.is_null(index) {

Review comment:
       [optional: This can be simplified, generalized and become more performant by using `collect`.]

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -499,20 +692,42 @@ fn signature(fun: &BuiltinScalarFunction) -> Signature {
     // for now, the list is small, as we do not have many built-in functions.

Review comment:
       this can go now xD

##########
File path: rust/datafusion/src/physical_plan/type_coercion.rs
##########
@@ -69,13 +69,42 @@ pub fn data_types(
     signature: &Signature,
 ) -> Result<Vec<DataType>> {
     let valid_types = match signature {
-        Signature::Variadic(valid_types) => valid_types
+        Signature::Any(number) => {
+            if current_types.len() != *number {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    number,
+                    current_types.len()
+                )));
+            }
+            vec![(0..*number).map(|i| current_types[i].clone()).collect()]
+        }
+        Signature::Exact(valid_types) => vec![valid_types.clone()],
+        Signature::Uniform(valid_types) => {
+            let valid_signature = valid_types
+                .iter()
+                .filter(|x| x.len() == current_types.len())
+                .collect::<Vec<_>>();
+            if valid_signature.len() != 1 {
+                return Err(DataFusionError::Plan(format!(
+                    "The function expected {} arguments but received {}",
+                    valid_types
+                        .iter()
+                        .map(|x| x.len().to_string())
+                        .collect::<Vec<_>>()
+                        .join(" or "),
+                    current_types.len()
+                )));
+            }
+            cartesian_product(valid_signature.first().unwrap())

Review comment:
       Wont this coerce any type to the first variant, even if the latter variant is accepted?
   
   I.e. if we use
   
   ```
   Uniform(vec![
       vec![vec![A]],
       vec![vec![B]],
   ])
   ```
   
   and pass arg types `vec![B]`, I would expect that no coercion would happen, but I suspect that this will coerce `B` to `A`, because the first entry with the same number of arguments is `vec![vec![A]]`.
   

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,38 +34,340 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {

Review comment:
       I think that this could be `Result<GenericStringArray<T>>` so that it supports both String and LargeString.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +35,553 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least 1.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()
+            })
+        })
+        .collect())
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
     let mut builder = StringBuilder::new(args.len());
     // for each entry in the array
     for index in 0..args[0].len() {
         let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
         for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
-            } else {
+            if arg.is_valid(index) {
                 owned_string.push_str(&arg.value(index));
             }
         }
-        if is_null {
+        builder.append_value(&owned_string)?;
+    }
+    Ok(builder.finish())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    let mut builder = StringBuilder::new(args.len());
+    // for each entry in the array
+    for index in 0..args[0].len() {
+        let mut owned_string: String = "".to_owned();
+        if args[0].is_null(index) {
             builder.append_null()?;
         } else {
+            let sep = args[0].value(index);
+            for arg_index in 1..args.len() {
+                let arg = &args[arg_index];
+                if !arg.is_null(index) {
+                    owned_string.push_str(&arg.value(index));
+                    // if not last push separator
+                    if arg_index != args.len() - 1 {
+                        owned_string.push_str(&sep);
+                    }
+                }
+            }
             builder.append_value(&owned_string)?;
-        }
+        };
     }
     Ok(builder.finish())
 }
 
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {

Review comment:
       Same here: `Result<GeneticStringArray<T>>` generalizes this :)

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,42 +34,446 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<StringArray> {
+    match args.len() {
+        0 => Err(DataFusionError::Internal(
+            "btrim was called with 0 arguments. It requires at least one.".to_string(),
+        )),
+        1 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let array = args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+                .unwrap()

Review comment:
       Why not error? If we remove that unwrap, the code should compile.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745375



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -259,6 +392,76 @@ pub fn return_type(
                 ));
             }
         }),
+        BuiltinScalarFunction::OctetLength => Ok(match arg_types[0] {
+            DataType::LargeUtf8 => DataType::Int64,
+            DataType::Utf8 => DataType::Int32,
+            _ => {
+                // this error is internal as `data_types` should have captured this.
+                return Err(DataFusionError::Internal(
+                    "The length function can only accept strings.".to_string(),

Review comment:
       fixed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [Splitting to separate PRs]

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-783370975


   @seddonm1  just to be clear, your plan is still to merge this branch in in smaller chunks -- e.g. https://github.com/apache/arrow/pull/9509?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#issuecomment-780094015


   > Postgres Documentation (https://www.postgresql.org/docs/13/functions-string.html):
   
   @seddonm1  -- I am not a lawyer but it seems proper given your description to add the the PostgresSQL license into https://github.com/apache/arrow/blob/master/LICENSE.txt (along with the large and distinguished list of licenses there)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575766659



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))

Review comment:
       Ok i think I have updated all the relevant functions to deal with unicode via `graphemes` (with tests). I wanted to do this before opening all the PRs so should be good to go now :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745664



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))

Review comment:
       Thanks Andrew. I will review this code again and ensure that we are being safe when it comes to UTF-8 encoding (DataFusion should be international). I think the safe way will include `graphemes` so we should plan for that import.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r575745074



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -34,40 +38,1167 @@ macro_rules! downcast_vec {
     }};
 }
 
-/// concatenate string columns together.
-pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
+/// Returns the numeric code of the first character of the argument.
+pub fn ascii<T: StringOffsetSizeTrait>(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut chars = x.chars();
+                chars.next().map_or(0, |v| v as i32)
+            })
+        })
+        .collect())
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start and end of string.
+pub fn btrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                                .trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "btrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i32(args: &[ArrayRef]) -> Result<Int32Array> {
+    let string_array: &GenericStringArray<i32> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i32>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i32))
+        .collect())
+}
+
+/// Returns number of characters in the string.
+pub fn character_length_i64(args: &[ArrayRef]) -> Result<Int64Array> {
+    let string_array: &GenericStringArray<i64> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<i64>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).count() as i64))
+        .collect())
+}
+
+/// Returns the character with the given code.
+pub fn chr(args: &[ArrayRef]) -> Result<StringArray> {
+    let integer_array: &Int64Array =
+        args[0].as_any().downcast_ref::<Int64Array>().unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    integer_array
+        .iter()
+        .map(|x: Option<i64>| {
+            x.map(|x| {
+                if x == 0 {
+                    Err(DataFusionError::Internal(
+                        "null character not permitted.".to_string(),
+                    ))
+                } else {
+                    match core::char::from_u32(x as u32) {
+                        Some(x) => Ok(x.to_string()),
+                        None => Err(DataFusionError::Internal(
+                            "requested character too large for encoding.".to_string(),
+                        )),
+                    }
+                }
+            })
+            .transpose()
+        })
+        .collect()
+}
+
+/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
+pub fn concat(args: &[ArrayRef]) -> Result<StringArray> {
     // downcast all arguments to strings
     let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
     // do not accept 0 arguments.
     if args.is_empty() {
         return Err(DataFusionError::Internal(
-            "Concatenate was called with 0 arguments. It requires at least one."
-                .to_string(),
+            "concat was called with 0 arguments. It requires at least 2.".to_string(),
         ));
     }
 
-    let mut builder = StringBuilder::new(args.len());
-    // for each entry in the array
-    for index in 0..args[0].len() {
-        let mut owned_string: String = "".to_owned();
-
-        // if any is null, the result is null
-        let mut is_null = false;
-        for arg in &args {
-            if arg.is_null(index) {
-                is_null = true;
-                break; // short-circuit as we already know the result
+    Ok((0..args[0].len())
+        .map(|index| {
+            let mut owned_string: String = "".to_owned();
+            for arg in &args {
+                if arg.is_valid(index) {
+                    owned_string.push_str(&arg.value(index));
+                };
+            }
+            Some(owned_string)
+        })
+        .collect())
+}
+
+/// Concatenates all but the first argument, with separators. The first argument is used as the separator string, and should not be NULL. Other NULL arguments are ignored.
+pub fn concat_ws(args: &[ArrayRef]) -> Result<StringArray> {
+    // downcast all arguments to strings
+    let args = downcast_vec!(args, StringArray).collect::<Result<Vec<&StringArray>>>()?;
+    // do not accept 0 or 1 arguments.
+    if args.len() < 2 {
+        return Err(DataFusionError::Internal(format!(
+            "concat_ws was called with {} arguments. It requires at least 2.",
+            args.len()
+        )));
+    }
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(args[0]
+        .iter()
+        .enumerate()
+        .map(|(index, x)| {
+            x.map(|sep: &str| {
+                let mut owned_string: String = "".to_owned();
+                for arg_index in 1..args.len() {
+                    let arg = &args[arg_index];
+                    if !arg.is_null(index) {
+                        owned_string.push_str(&arg.value(index));
+                        // if not last push separator
+                        if arg_index != args.len() - 1 {
+                            owned_string.push_str(&sep);
+                        }
+                    }
+                }
+                owned_string
+            })
+        })
+        .collect())
+}
+
+/// Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters.
+pub fn initcap<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| {
+            x.map(|x: &str| {
+                let mut char_vector = Vec::<char>::new();
+                let mut wasalnum = false;
+                for c in x.chars() {
+                    if wasalnum {
+                        char_vector.push(c.to_ascii_lowercase());
+                    } else {
+                        char_vector.push(c.to_ascii_uppercase());
+                    }
+                    wasalnum = ('A'..='Z').contains(&c)
+                        || ('a'..='z').contains(&c)
+                        || ('0'..='9').contains(&c);
+                }
+                char_vector.iter().collect::<String>()
+            })
+        })
+        .collect())
+}
+
+/// Returns first n characters in the string, or when n is negative, returns all but last |n| characters.
+pub fn left<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
             } else {
-                owned_string.push_str(&arg.value(index));
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => {
+                            x.char_indices().nth(n as usize).map_or(x, |(i, _)| &x[..i])
+                        }
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[..i + 1]),
+                    }
+                })
             }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right).
+pub fn lpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.insert_str(0, " ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
         }
-        if is_null {
-            builder.append_null()?;
-        } else {
-            builder.append_value(&owned_string)?;
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.insert_str(
+                                    0,
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "lpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the start of string.
+pub fn ltrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_start()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_start_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
         }
+        other => Err(DataFusionError::Internal(format!(
+            "ltrim was called with {} arguments. It requires at most 2.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.
+pub fn repeat<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let number_array: &Int64Array =
+        args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if number_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.repeat(number_array.value(i) as usize))
+            }
+        })
+        .collect())
+}
+
+/// Replaces all occurrences in string of substring from with substring to.
+pub fn replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let from_array: &GenericStringArray<T> = args[1]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    let to_array: &GenericStringArray<T> = args[2]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if from_array.is_null(i) || to_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| x.replace(from_array.value(i), to_array.value(i)))
+            }
+        })
+        .collect())
+}
+
+// used to replace POSIX capture groups (like \1) with Rust Regex group (like ${1})
+fn regex_replace_posix_groups(replacement: &str) -> String {
+    lazy_static! {
+        static ref CAPTURE_GROUPS_RE: Regex = Regex::new("(\\\\)(\\d*)").unwrap();
     }
-    Ok(builder.finish())
+    CAPTURE_GROUPS_RE
+        .replace_all(replacement, "$${$2}")
+        .into_owned()
+}
+
+/// Replaces substring(s) matching a POSIX regular expression
+pub fn regexp_replace<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    // creating Regex is expensive so create hashmap for memoization
+    let mut patterns: HashMap<String, Regex> = HashMap::new();
+
+    match args.len() {
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let pattern = pattern_array.value(i).to_string();
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+                            re.map(|re| re.replace(x, replacement.as_str()))
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        4 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let pattern_array: &StringArray = args[1]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let replacement_array: &StringArray = args[2]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            let flags_array: &StringArray = args[3]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .unwrap();
+
+            string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if pattern_array.is_null(i) || replacement_array.is_null(i) || flags_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let replacement = regex_replace_posix_groups(replacement_array.value(i));
+
+                            let flags = flags_array.value(i);
+                            let (pattern, replace_all) = if flags == "g" {
+                                (pattern_array.value(i).to_string(), true)
+                            } else if flags.contains('g') {
+                                (format!("(?{}){}", flags.to_string().replace("g", ""), pattern_array.value(i)), true)
+                            } else {
+                                (format!("(?{}){}", flags, pattern_array.value(i)), false)
+                            };
+
+                            let re = match patterns.get(pattern_array.value(i)) {
+                                Some(re) => Ok(re.clone()),
+                                None => {
+                                    match Regex::new(pattern.as_str()) {
+                                        Ok(re) => {
+                                            patterns.insert(pattern, re.clone());
+                                            Ok(re)
+                                        },
+                                        Err(err) => Err(DataFusionError::Execution(err.to_string())),
+                                    }
+                                }
+                            };
+
+                            re.map(|re| {
+                                if replace_all {
+                                    re.replace_all(x, replacement.as_str())
+                                } else {
+                                    re.replace(x, replacement.as_str())
+                                }
+                            })
+                        })
+                    }.transpose()
+                })
+                .collect()
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "regexp_replace was called with {} arguments. It requires at least 3 and at most 4.",
+            other
+        ))),
+    }
+}
+
+/// Reverses the order of the characters in the string.
+pub fn reverse<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(string_array
+        .iter()
+        .map(|x| x.map(|x: &str| x.graphemes(true).rev().collect::<String>()))
+        .collect())
+}
+
+/// Returns last n characters in the string, or when n is negative, returns all but first |n| characters.
+pub fn right<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    let string_array: &GenericStringArray<T> = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("could not cast string to StringArray".to_string())
+        })?;
+
+    let n_array: &Int64Array =
+        args[1]
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .ok_or_else(|| {
+                DataFusionError::Internal("could not cast n to Int64Array".to_string())
+            })?;
+
+    Ok(string_array
+        .iter()
+        .enumerate()
+        .map(|(i, x)| {
+            if n_array.is_null(i) {
+                None
+            } else {
+                x.map(|x: &str| {
+                    let n: i64 = n_array.value(i);
+                    match n.cmp(&0) {
+                        Ordering::Equal => "",
+                        Ordering::Greater => x
+                            .char_indices()
+                            .nth(n as usize)
+                            .map_or(x, |(i, _)| &x[i + 1..]),
+                        Ordering::Less => x
+                            .char_indices()
+                            .rev()
+                            .nth(n.abs() as usize)
+                            .map_or("", |(i, _)| &x[i..]),
+                    }
+                })
+            }
+        })
+        .collect())
+}
+
+/// Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated.
+pub fn rpad<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array = args[1]
+                .as_any()
+                .downcast_ref::<Int64Array>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast length to Int64Array".to_string(),
+                    )
+                })?;
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                s.push_str(" ".repeat(length - x.len()).as_str());
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        3 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let length_array: &Int64Array =
+                args[1].as_any().downcast_ref::<Int64Array>().unwrap();
+
+            let fill_array: &GenericStringArray<T> = args[2]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if length_array.is_null(i) || fill_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let length = length_array.value(i) as usize;
+                            let fill_chars =
+                                fill_array.value(i).chars().collect::<Vec<char>>();
+                            if length == 0 {
+                                "".to_string()
+                            } else if length < x.len() {
+                                x[..length].to_string()
+                            } else if fill_chars.is_empty() {
+                                x.to_string()
+                            } else {
+                                let mut s = x.to_string();
+                                let mut char_vector =
+                                    Vec::<char>::with_capacity(length - x.len());
+                                for l in 0..length - x.len() {
+                                    char_vector.push(
+                                        *fill_chars.get(l % fill_chars.len()).unwrap(),
+                                    );
+                                }
+                                s.push_str(
+                                    char_vector.iter().collect::<String>().as_str(),
+                                );
+                                s
+                            }
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rpad was called with {} arguments. It requires at least 2 and at most 3.",
+            other
+        ))),
+    }
+}
+
+/// Removes the longest string containing only characters in characters (a space by default) from the end of string.
+pub fn rtrim<T: StringOffsetSizeTrait>(
+    args: &[ArrayRef],
+) -> Result<GenericStringArray<T>> {
+    match args.len() {
+        1 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .map(|x| x.map(|x: &str| x.trim_end()))
+                .collect())
+        }
+        2 => {
+            let string_array: &GenericStringArray<T> = args[0]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            let characters_array: &GenericStringArray<T> = args[1]
+                .as_any()
+                .downcast_ref::<GenericStringArray<T>>()
+                .unwrap();
+
+            Ok(string_array
+                .iter()
+                .enumerate()
+                .map(|(i, x)| {
+                    if characters_array.is_null(i) {
+                        None
+                    } else {
+                        x.map(|x: &str| {
+                            let chars: Vec<char> =
+                                characters_array.value(i).chars().collect();
+                            x.trim_end_matches(&chars[..])
+                        })
+                    }
+                })
+                .collect())
+        }
+        other => Err(DataFusionError::Internal(format!(
+            "rtrim was called with {} arguments. It requires at most two.",
+            other
+        ))),
+    }
+}
+
+/// Repeats string the specified number of times.

Review comment:
       thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on a change in pull request #9243: ARROW-11298: [Rust][DataFusion] Implement Postgres String Functions [WIP]

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on a change in pull request #9243:
URL: https://github.com/apache/arrow/pull/9243#discussion_r567146994



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -60,10 +59,15 @@ pub enum Signature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// fixed number of arguments of vector of vectors of valid types
+    // A function of one argument of f64 is `Uniform(vc![vec![vec![DataType::Float64]]])`
+    // A function of one argument of f64 or f32 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64]]])`
+    // A function of two arguments with first argument of f64 or f32 and second argument of utf8 is `Uniform(vec![vec![vec![DataType::Float32, DataType::Float64], vec![DataType::Utf8]]])`
+    Uniform(Vec<Vec<Vec<DataType>>>),

Review comment:
       I have split this code out (renamed to `OneOf` with the `lpad` function to demonstrate its purpose) here: https://github.com/seddonm1/arrow/tree/oneof-function-signature
   
   I would appreciate some of your brain time to help resolve this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org