You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/19 07:16:56 UTC

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8966: ARROW-10969: [Rust][DataFusion] Implement basic String ANSI SQL Functions

jorgecarleitao commented on a change in pull request #8966:
URL: https://github.com/apache/arrow/pull/8966#discussion_r546203216



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -66,3 +71,73 @@ pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
     }
     Ok(builder.finish())
 }
+
+/// character_length returns number of characters in the string
+/// character_length('josé') = 4
+pub fn character_length(args: &[ArrayRef]) -> Result<Int32Array> {
+    let num_rows = args[0].len();
+    let string_args =
+        &args[0]
+            .as_any()
+            .downcast_ref::<StringArray>()
+            .ok_or_else(|| {
+                DataFusionError::Internal(
+                    "could not cast input to StringArray".to_string(),
+                )
+            })?;
+
+    let result = (0..num_rows)
+        .map(|i| {
+            if string_args.is_null(i) {
+                // NB: Since we use the same null bitset as the input,
+                // the output for this value will be ignored, but we
+                // need some value in the array we are building.
+                Ok(0)
+            } else {
+                Ok(string_args.value(i).chars().count() as i32)
+            }
+        })
+        .collect::<Result<Vec<_>>>()?;
+
+    let data = ArrayData::new(
+        DataType::Int32,
+        num_rows,
+        Some(string_args.null_count()),
+        string_args.data().null_buffer().cloned(),
+        0,
+        vec![Buffer::from(result.to_byte_slice())],
+        vec![],
+    );
+
+    Ok(Int32Array::from(Arc::new(data)))
+}
+
+macro_rules! string_unary_function {
+    ($NAME:ident, $FUNC:ident) => {
+        /// string function that accepts utf8 and returns utf8
+        pub fn $NAME(args: &[ArrayRef]) -> Result<StringArray> {
+            let string_args = &args[0]
+                .as_any()
+                .downcast_ref::<StringArray>()
+                .ok_or_else(|| {
+                    DataFusionError::Internal(
+                        "could not cast input to StringArray".to_string(),
+                    )
+                })?;
+
+            let mut builder = StringBuilder::new(args.len());
+            for index in 0..args[0].len() {
+                if string_args.is_null(index) {
+                    builder.append_null()?;
+                } else {
+                    builder.append_value(&string_args.value(index).$FUNC())?;
+                }
+            }
+            Ok(builder.finish())

Review comment:
       The Arrow crate implements efficient `IntoIter` and `FromIter` that generally make the code simpler to read and more performant because it performs less bound checks. I.e. something like
   
   ```rust
   string_args.iter().map(|x| x.map(|x| x.$FUNC())).collect()
   // (first map is the iterator, second is for the `Option<_>`
   ```
   
   will probably work.

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1826,3 +1826,21 @@ async fn csv_between_expr_negated() -> Result<()> {
     assert_eq!(expected, actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn string_expressions() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    register_aggregate_csv(&mut ctx)?;
+    let sql = "SELECT
+        char_length('josé') AS char_length
+        ,character_length('josé') AS character_length
+        ,lower('TOM') AS lower
+        ,upper('tom') AS upper
+        ,trim(' tom ') AS trim

Review comment:
       I think that this does not cover the null cases.

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -203,6 +216,10 @@ pub fn return_type(
             }
         }),
         BuiltinScalarFunction::Concat => Ok(DataType::Utf8),
+        BuiltinScalarFunction::CharacterLength => Ok(DataType::Int32),
+        BuiltinScalarFunction::Lower => Ok(DataType::Utf8),
+        BuiltinScalarFunction::Upper => Ok(DataType::Utf8),
+        BuiltinScalarFunction::Trim => Ok(DataType::Utf8),

Review comment:
       Could you add functions to `logical_plan::expr` and `prelude` to expose these new functions also there, so that they can also be used in the DataFrame API?

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1826,3 +1826,21 @@ async fn csv_between_expr_negated() -> Result<()> {
     assert_eq!(expected, actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn string_expressions() -> Result<()> {
+    let mut ctx = ExecutionContext::new();
+    register_aggregate_csv(&mut ctx)?;

Review comment:
       Do we need to register this CSV?

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -280,6 +309,18 @@ fn signature(fun: &BuiltinScalarFunction) -> Signature {
             Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8])
         }
         BuiltinScalarFunction::Concat => Signature::Variadic(vec![DataType::Utf8]),
+        BuiltinScalarFunction::CharacterLength => {
+            Signature::Uniform(1, vec![DataType::Utf8, DataType::LargeUtf8])

Review comment:
       The signature states that `LargeUtf8` is supported, but the implementation only supports `Utf8`.
   
   If we only use `DataType::Utf8` here, the planner will coerce any `LargeUtf8` to `Utf8`. :)

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -66,3 +71,73 @@ pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
     }
     Ok(builder.finish())
 }
+
+/// character_length returns number of characters in the string
+/// character_length('josé') = 4
+pub fn character_length(args: &[ArrayRef]) -> Result<Int32Array> {
+    let num_rows = args[0].len();
+    let string_args =
+        &args[0]
+            .as_any()
+            .downcast_ref::<StringArray>()
+            .ok_or_else(|| {
+                DataFusionError::Internal(
+                    "could not cast input to StringArray".to_string(),
+                )
+            })?;
+
+    let result = (0..num_rows)

Review comment:
       I would try to use the from and to iterator here also. It should make this code simpler also.

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -66,3 +71,73 @@ pub fn concatenate(args: &[ArrayRef]) -> Result<StringArray> {
     }
     Ok(builder.finish())
 }
+
+/// character_length returns number of characters in the string
+/// character_length('josé') = 4
+pub fn character_length(args: &[ArrayRef]) -> Result<Int32Array> {

Review comment:
       Any reason to use `Int32` instead of `UInt32`?

##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -118,6 +118,14 @@ pub enum BuiltinScalarFunction {
     Length,
     /// concat
     Concat,
+    /// character_length
+    CharacterLength,
+    /// lower
+    Lower,
+    /// upper
+    Upper,
+    /// ltrim

Review comment:
       Is this `ltrim` or trim? The test seems to indicate `trim`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org