You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/30 19:28:42 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

jorgecarleitao opened a new pull request #9376:
URL: https://github.com/apache/arrow/pull/9376


   This is a work in progress to add support for `ScalarValue` to all builtin-functions and UDFs from DataFusion.
   
   This allows to execute scalar functions without having to create arrays and pass then to the kernels.
   
   It is WIP because I am still migrating some kernels (`null_if` and temporal kernels). I am sharing this here because I can see a lot of activity in adding new kernels, and it would be good to have this work aligned together.
   
   With this change, creating a kernel is a bit more time consuming, as it is necessary to cater for both the scalar and vector case. OTOH, this leads to a major performance improvement as we only need to perform 1 vs N operations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568342573



##########
File path: rust/arrow/src/conversions.rs
##########
@@ -0,0 +1,127 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       It was renamed on a different PR to `temporal_conversions`. I rebased this PR so that we no longer have these changes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] velvia commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
velvia commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r575728461



##########
File path: rust/datafusion/src/physical_plan/expressions/nullif.rs
##########
@@ -0,0 +1,188 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::sync::Arc;
+
+use super::ColumnarValue;
+use crate::error::{DataFusionError, Result};
+use crate::scalar::ScalarValue;
+use arrow::array::Array;
+use arrow::array::{
+    ArrayRef, BooleanArray, Date32Array, Date64Array, Float32Array, Float64Array,
+    Int16Array, Int32Array, Int64Array, Int8Array, StringArray, TimestampNanosecondArray,
+    UInt16Array, UInt32Array, UInt64Array, UInt8Array,
+};
+use arrow::compute::kernels::boolean::nullif;
+use arrow::compute::kernels::comparison::{eq, eq_scalar, eq_utf8, eq_utf8_scalar};
+use arrow::datatypes::{DataType, TimeUnit};
+
+/// Invoke a compute kernel on a primitive array and a Boolean Array
+macro_rules! compute_bool_array_op {
+    ($LEFT:expr, $RIGHT:expr, $OP:ident, $DT:ident) => {{
+        let ll = $LEFT
+            .as_any()
+            .downcast_ref::<$DT>()
+            .expect("compute_op failed to downcast array");
+        let rr = $RIGHT
+            .as_any()
+            .downcast_ref::<BooleanArray>()
+            .expect("compute_op failed to downcast array");
+        Ok(Arc::new($OP(&ll, &rr)?) as ArrayRef)
+    }};
+}
+
+/// Binary op between primitive and boolean arrays
+macro_rules! primitive_bool_array_op {
+    ($LEFT:expr, $RIGHT:expr, $OP:ident) => {{
+        match $LEFT.data_type() {
+            DataType::Int8 => compute_bool_array_op!($LEFT, $RIGHT, $OP, Int8Array),
+            DataType::Int16 => compute_bool_array_op!($LEFT, $RIGHT, $OP, Int16Array),
+            DataType::Int32 => compute_bool_array_op!($LEFT, $RIGHT, $OP, Int32Array),
+            DataType::Int64 => compute_bool_array_op!($LEFT, $RIGHT, $OP, Int64Array),
+            DataType::UInt8 => compute_bool_array_op!($LEFT, $RIGHT, $OP, UInt8Array),
+            DataType::UInt16 => compute_bool_array_op!($LEFT, $RIGHT, $OP, UInt16Array),
+            DataType::UInt32 => compute_bool_array_op!($LEFT, $RIGHT, $OP, UInt32Array),
+            DataType::UInt64 => compute_bool_array_op!($LEFT, $RIGHT, $OP, UInt64Array),
+            DataType::Float32 => compute_bool_array_op!($LEFT, $RIGHT, $OP, Float32Array),
+            DataType::Float64 => compute_bool_array_op!($LEFT, $RIGHT, $OP, Float64Array),
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for NULLIF/primitive/boolean operator",
+                other
+            ))),
+        }
+    }};
+}
+
+/// Implements NULLIF(expr1, expr2)
+/// Args: 0 - left expr is any array
+///       1 - if the left is equal to this expr2, then the result is NULL, otherwise left value is passed.
+///
+pub fn nullif_func(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+    if args.len() != 2 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but NULLIF takes exactly two args",
+            args.len(),
+        )));
+    }
+
+    let (lhs, rhs) = (&args[0], &args[1]);
+
+    match (lhs, rhs) {
+        (ColumnarValue::Array(lhs), ColumnarValue::Scalar(rhs)) => {
+            let cond_array = binary_array_op_scalar!(lhs, rhs.clone(), eq).unwrap()?;

Review comment:
       This is definitely one of the functions I would expect to benefit the most from this change, given that NULLIF most of the time the right hand argument is likely to be a scalar.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568348305



##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       The pattern depends on the input and output of the function. I.e. when input is `&str`, then `Utf8/LargeUtf8`. When output is a `Native`, then the output is `PrimitiveArray<O::Native>`. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
   
   I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (`string_to_timestamp_nanos` in this case).
   
   In crypto_expressions we have a similar pattern, but in this case the function is `&str -> AsRef<[u8]>`, which allowed to write all cripto `sha` in a succinct manner. However, in that case, the output type is always `Binary` instead of `LargeBinary` for `LargeStringArray`, because the hashes are always smaller than `i32::MAX`.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r574621673



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -49,58 +58,142 @@ fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> {
     digest.finalize()
 }
 
-macro_rules! crypto_unary_string_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericStringArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
-            }
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+fn unary_binary_function<T, R, F>(
+    args: &[&dyn Array],
+    op: F,
+    name: &str,
+) -> Result<BinaryArray>
+where
+    R: AsRef<[u8]>,
+    T: StringOffsetSizeTrait,
+    F: Fn(&str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
 
-            let array = args[0]
-                .as_any()
-                .downcast_ref::<GenericStringArray<T>>()
-                .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())

Review comment:
       Good idea, though, I also though it would work. However, because the functions have different signatures, a deref is needed and thus we need to write it explicitly. Same for `md5_process`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778382279


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=h1) Report
   > Merging [#9376](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=desc) (4391523) into [master](https://codecov.io/gh/apache/arrow/commit/7660a22090fcf1d0230ca1e700a4b98f647b0c48?el=desc) (7660a22) will **decrease** coverage by `0.16%`.
   > The diff coverage is `64.89%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9376/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9376      +/-   ##
   ==========================================
   - Coverage   82.31%   82.15%   -0.17%     
   ==========================================
     Files         234      235       +1     
     Lines       54482    54603     +121     
   ==========================================
   + Hits        44849    44860      +11     
   - Misses       9633     9743     +110     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `93.83% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==) | `85.53% <ø> (-0.32%)` | :arrow_down: |
   | [...st/datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL21vZC5ycw==) | `71.42% <ø> (-18.58%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21vZC5ycw==) | `86.00% <ø> (ø)` | |
   | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `50.80% <19.14%> (-4.89%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/crypto\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NyeXB0b19leHByZXNzaW9ucy5ycw==) | `52.45% <43.13%> (-22.55%)` | :arrow_down: |
   | [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz) | `69.40% <64.28%> (-23.89%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `67.12% <68.57%> (-19.84%)` | :arrow_down: |
   | [.../datafusion/src/physical\_plan/array\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FycmF5X2V4cHJlc3Npb25zLnJz) | `43.33% <77.77%> (+11.51%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/expressions/nullif.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL251bGxpZi5ycw==) | `86.88% <86.88%> (ø)` | |
   | ... and [10 more](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=footer). Last update [7660a22...4391523](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419247



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -49,58 +58,142 @@ fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> {
     digest.finalize()
 }
 
-macro_rules! crypto_unary_string_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericStringArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
-            }
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+fn unary_binary_function<T, R, F>(
+    args: &[&dyn Array],
+    op: F,
+    name: &str,
+) -> Result<BinaryArray>
+where
+    R: AsRef<[u8]>,
+    T: StringOffsetSizeTrait,
+    F: Fn(&str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
 
-            let array = args[0]
-                .as_any()
-                .downcast_ref::<GenericStringArray<T>>()
-                .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())
+}
+
+fn handle<F, R>(args: &[ColumnarValue], op: F, name: &str) -> Result<ColumnarValue>
+where
+    R: AsRef<[u8]>,
+    F: Fn(&str) -> R,
+{
+    match &args[0] {
+        ColumnarValue::Array(a) => match a.data_type() {
+            DataType::Utf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i32,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i64,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function {}",
+                other, name,
+            ))),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());

Review comment:
       ```suggestion
                   let result = a.as_ref().map(|x| op(x).as_ref().to_vec());
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568535264



##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -58,77 +54,54 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
-// a small utility function to compute pow(base, exponent)
-fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
-    match (base, exponent) {
-        // in arrow, any value can be null.
-        // Here we decide to make our UDF to return null when either base or exponent is null.
-        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
-        _ => None,
-    }
-}
-
-fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
-    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
-    let base = base
-        .as_any()
-        .downcast_ref::<Float64Array>()
-        .expect("cast failed");
-    let exponent = exponent
-        .as_any()
-        .downcast_ref::<Float64Array>()
-        .expect("cast failed");
-
-    // this is guaranteed by DataFusion. We place it just to make it obvious.
-    assert_eq!(exponent.len(), base.len());
-
-    // 2. perform the computation
-    let array = base
-        .iter()
-        .zip(exponent.iter())
-        .map(|(base, exponent)| maybe_pow(&base, &exponent))
-        .collect::<Float64Array>();
-
-    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)
-    // `Arc` because arrays are immutable, thread-safe, trait objects.
-    Ok(Arc::new(array))
-}
-
 /// In this example we will declare a single-type, single return type UDF that exponentiates f64, a^b
 #[tokio::main]
 async fn main() -> Result<()> {
     let mut ctx = create_context()?;
 
     // First, declare the actual implementation of the calculation
-    let pow: ScalarFunctionImplementation = Arc::new(|args: &[ColumnarValue]| {
-        // in DataFusion, all `args` and output are `ColumnarValue`, an enum of either a scalar or a dynamically-typed array.
-        // we can cater for both, or document that the UDF only supports some variants.
-        // here we will assume that al
+    let pow = |args: &[ArrayRef]| {
+        // in DataFusion, all `args` and output are dynamically-typed arrays, which means that we need to:
         // 1. cast the values to the type we want
         // 2. perform the computation for every element in the array (using a loop or SIMD) and construct the result
 
         // this is guaranteed by DataFusion based on the function's signature.
         assert_eq!(args.len(), 2);
 
-        let (base, exponent) = (&args[0], &args[1]);
-
-        let result = match (base, exponent) {
-            (
-                ColumnarValue::Scalar(ScalarValue::Float64(base)),
-                ColumnarValue::Scalar(ScalarValue::Float64(exponent)),
-            ) => ColumnarValue::Scalar(ScalarValue::Float64(maybe_pow(base, exponent))),
-            (ColumnarValue::Array(base), ColumnarValue::Array(exponent)) => {
-                let array = pow_array(base.as_ref(), exponent.as_ref())?;
-                ColumnarValue::Array(array)
-            }
-            _ => {
-                return Err(DataFusionError::Execution(
-                    "This UDF only supports f64".to_string(),
-                ))
-            }
-        };
-        Ok(result)
-    });
+        // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+        let base = &args[0]
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("cast failed");
+        let exponent = &args[1]
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("cast failed");
+
+        // this is guaranteed by DataFusion. We place it just to make it obvious.
+        assert_eq!(exponent.len(), base.len());
+
+        // 2. perform the computation
+        let array = base
+            .iter()
+            .zip(exponent.iter())
+            .map(|(base, exponent)| {
+                match (base, exponent) {
+                    // in arrow, any value can be null.
+                    // Here we decide to make our UDF to return null when either base or exponent is null.
+                    (Some(base), Some(exponent)) => Some(base.powf(exponent)),
+                    _ => None,
+                }
+            })
+            .collect::<Float64Array>();
+
+        // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)
+        // `Arc` because arrays are immutable, thread-safe, trait objects.
+        Ok(Arc::new(array) as ArrayRef)
+    };
+    // the function above expects an `ArrayRef`, but DataFusion may pass a scalar to a UDF.
+    // thus, we use `make_scalar_function` to decorare the closure so that it can handle both Arrays and Scalar values.

Review comment:
       👍 

##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -54,50 +58,76 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
+// a small utility function to compute pow(base, exponent)
+fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
+    match (base, exponent) {
+        // in arrow, any value can be null.
+        // Here we decide to make our UDF to return null when either base or exponent is null.
+        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
+        _ => None,
+    }
+}
+
+fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
+    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+    let base = base
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+    let exponent = exponent
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+
+    // this is guaranteed by DataFusion. We place it just to make it obvious.
+    assert_eq!(exponent.len(), base.len());
+
+    // 2. perform the computation
+    let array = base
+        .iter()
+        .zip(exponent.iter())
+        .map(|(base, exponent)| maybe_pow(&base, &exponent))
+        .collect::<Float64Array>();
+
+    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)

Review comment:
       cool -- sorry I guess I am used to seeing an open interval  using a `)` --so in this case something like `[0, 1)` to represent `0 <= exponent < 1` (e.g. [here](https://en.wikipedia.org/wiki/Interval_(mathematics)#Terminology)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568342573



##########
File path: rust/arrow/src/conversions.rs
##########
@@ -0,0 +1,127 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       It was renamed on a different PR to `temporal_conversions`. I rebased this PR so that we no longer have these changes.

##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -54,50 +58,76 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
+// a small utility function to compute pow(base, exponent)
+fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
+    match (base, exponent) {
+        // in arrow, any value can be null.
+        // Here we decide to make our UDF to return null when either base or exponent is null.
+        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
+        _ => None,
+    }
+}
+
+fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
+    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+    let base = base
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+    let exponent = exponent
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+
+    // this is guaranteed by DataFusion. We place it just to make it obvious.
+    assert_eq!(exponent.len(), base.len());
+
+    // 2. perform the computation
+    let array = base
+        .iter()
+        .zip(exponent.iter())
+        .map(|(base, exponent)| maybe_pow(&base, &exponent))
+        .collect::<Float64Array>();
+
+    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)

Review comment:
       This represents an open interval: `0 <= exponent < 1` because `(x)^1 = x` for all x (including negative ones).

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       The pattern depends on the input and output of the function. I.e. when input is `&str`, then `Utf8/LargeUtf8`. When output is a `Native`, then the output is `PrimitiveArray<O::Native>`. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
   
   I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (`string_to_timestamp_nanos` in this case).
   
   In crypto_expressions we have a similar pattern, but in this case the function is `&str -> AsRef<[u8]>`, which allowed to write all cripto `sha` in a succinct manner. However, in that case, the output type is always `Binary` instead of `LargeBinary` for `LargeStringArray`, because the hashes are always smaller than `i32::MAX`.
   
   

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       The pattern depends on the input and output of the function. I.e. when input is `&str`, then `Utf8/LargeUtf8`. When output is a `Native`, then the output is `PrimitiveArray<O::Native>`. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
   
   I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (`string_to_timestamp_nanos` in this case).
   
   In `crypto_expressions` we have a similar pattern, but in this case the function is `&str -> AsRef<[u8]>`, which allowed to write all cripto `sha` in a succinct manner. However, in that case, the output type is always `Binary` instead of `LargeBinary` for `LargeStringArray`, because the hashes are always smaller than `i32::MAX`. All of this was already written, I just expanded it for the two variants (scalar and vector).
   
   Note that `crypto_expressions::handle` and `crypto_expressions::md5` are very similar, but their return types are different: `unary_binary_function` receives a `GenericStringArray`, but returns a `BinaryArray`. This is because `MD5`'s signature is `string -> string`, while `sha` is `string -> binary`.
   

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       The pattern depends on the input and output of the function. I.e. when input is `&str`, then `Utf8/LargeUtf8`. When output is a `Native`, then the output is `PrimitiveArray<O::Native>`. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
   
   I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (`string_to_timestamp_nanos` in this case).
   
   In `crypto_expressions` we have a similar pattern, but in this case the function is `&str -> AsRef<[u8]>`, which allowed to write all cripto `sha` in a succinct manner. However, in that case, the output type is always `Binary` instead of `LargeBinary` for `LargeStringArray`, because the hashes are always smaller than `i32::MAX`. All of this was already written, I just expanded it for the two variants (scalar and vector).
   
   Note that `crypto_expressions::handle` and `crypto_expressions::md5` are very similar, but their return types are different: `handle` receives a `GenericStringArray`, but returns a `BinaryArray`. This is because `MD5`'s signature is `string -> string`, while `sha` is `string -> binary`.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770464465


   > @jorgecarleitao -- I agree with @Dandandan -- think `date_trunc(scalar, array)` and `date_trun(scalar, scalar) are the really important cases. I have no use case for `date_trunc(array, array)`
   
   Having been trying to implement a lot of the Postgres logic recently they all logically support the use of arrays for each of the parameters (i.e. `date_trunc(array, array)`) I have no idea how common the use of this functionality really is - and I suspect it is low - but if we are building the API for Postgres compatibility it would be good to solve.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568168403



##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -54,50 +58,76 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
+// a small utility function to compute pow(base, exponent)
+fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
+    match (base, exponent) {
+        // in arrow, any value can be null.
+        // Here we decide to make our UDF to return null when either base or exponent is null.
+        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
+        _ => None,
+    }
+}
+
+fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
+    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+    let base = base
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+    let exponent = exponent
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+
+    // this is guaranteed by DataFusion. We place it just to make it obvious.
+    assert_eq!(exponent.len(), base.len());
+
+    // 2. perform the computation
+    let array = base
+        .iter()
+        .zip(exponent.iter())
+        .map(|(base, exponent)| maybe_pow(&base, &exponent))
+        .collect::<Float64Array>();
+
+    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)

Review comment:
       ```suggestion
       // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1] and the base < 0 because that panics!)
   ```

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       I may be missing it -- but this function only seems to be invoked once with the same set of type arguments -- does it need to be generic? Or more broadly, can we hoist this seemingly repeated `handle` pattern somewhere it can be  reused (and reduce the cognitive load on people writing new functions?)

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(

Review comment:
       I think adding an example of using this pattern for implementing UDfs might be really helpful

##########
File path: rust/arrow/src/conversions.rs
##########
@@ -0,0 +1,127 @@
+// Licensed to the Apache Software Foundation (ASF) under one

Review comment:
       this module might be more specifically named `datetime_conversions.rs`

##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -17,27 +17,100 @@
 
 //! String expressions
 
-use crate::error::{DataFusionError, Result};
-use arrow::array::{
-    Array, ArrayRef, GenericStringArray, StringArray, StringBuilder,
-    StringOffsetSizeTrait,
+use std::sync::Arc;
+
+use crate::{
+    error::{DataFusionError, Result},
+    scalar::ScalarValue,
+};
+use arrow::{
+    array::{Array, GenericStringArray, StringArray, StringOffsetSizeTrait},
+    datatypes::DataType,
 };
 
-macro_rules! downcast_vec {
-    ($ARGS:expr, $ARRAY_TYPE:ident) => {{
-        $ARGS
-            .iter()
-            .map(|e| match e.as_any().downcast_ref::<$ARRAY_TYPE>() {
-                Some(array) => Ok(array),
-                _ => Err(DataFusionError::Internal("failed to downcast".to_string())),
-            })
-    }};
+use super::ColumnarValue;
+
+pub(crate) fn unary_string_function<'a, T, O, F, R>(

Review comment:
       this function is effectively applying a `str->str` function `op` to all values in `args`? Some comments might be helpful




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770365424


   @jorgecarleitao  -- I agree with @Dandandan  -- think `date_trunc(scalar, array)`  and `date_trun(scalar, scalar) are the really important cases. I have no use case for `date_trunc(array, array)`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778126649


   @seddonm1  -- what do you think about merge order of this PR and #9243 ? (which will conflict)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] removed a comment on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] removed a comment on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770268175


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568532324



##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       makes sense -- thank you for the explination

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       makes sense -- thank you for the explanation




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778133305


   Unfortunately (for me) this logically does go first as being able to identify ScalarValue would give a huge performance advantage. 
   
   I am happy to rework the other one after this is merged.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770344479


   FWIW, PostgreSQL also doesn't support the non-scalar granularity for `date_trunc`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-774456936


   @andygrove  / @Dandandan / @seddonm1  / @maxburke / @jhorstmann  -- what are your thoughts on this one?
   
   It is a significant enough change I think someone using DataFusion in their projects, other than myself should weigh in. I like it a lot -- and I think it could serve as the basis for constant folding (e.g. transform `A + (5 + 7)` --> `A + 12` at plan time)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-777602556


   Rebased. 💦


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419483



##########
File path: rust/datafusion/src/physical_plan/string_expressions.rs
##########
@@ -17,27 +17,108 @@
 
 //! String expressions
 
-use crate::error::{DataFusionError, Result};
-use arrow::array::{
-    Array, ArrayRef, GenericStringArray, StringArray, StringBuilder,
-    StringOffsetSizeTrait,
+use std::sync::Arc;
+
+use crate::{
+    error::{DataFusionError, Result},
+    scalar::ScalarValue,
+};
+use arrow::{
+    array::{Array, GenericStringArray, StringArray, StringOffsetSizeTrait},
+    datatypes::DataType,
 };
 
-macro_rules! downcast_vec {
-    ($ARGS:expr, $ARRAY_TYPE:ident) => {{
-        $ARGS
-            .iter()
-            .map(|e| match e.as_any().downcast_ref::<$ARRAY_TYPE>() {
-                Some(array) => Ok(array),
-                _ => Err(DataFusionError::Internal("failed to downcast".to_string())),
-            })
-    }};
+use super::ColumnarValue;
+
+/// applies a unary expression to `args[0]` that is expected to be downcastable to
+/// a `GenericStringArray` and returns a `GenericStringArray` (which may have a different offset)
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+pub(crate) fn unary_string_function<'a, T, O, F, R>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<GenericStringArray<O>>
+where
+    R: AsRef<str>,
+    O: StringOffsetSizeTrait,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
+
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())

Review comment:
       ```suggestion
       Ok(array.iter().map(|x| x.map(op).collect())
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770268294


   cc  @alamb @andygrove @Dandandan @seddonm1 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-777796721


   This is now ready to merge. This will collide with an also large PR, #9243. We have a function to enable compatibility, but it is still some work :(


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9376:
URL: https://github.com/apache/arrow/pull/9376


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] velvia edited a comment on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
velvia edited a comment on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778693797


   Commenting since @maxburke pinged me.
   
   On the surface I think this is a great change from the performance perspective.  I totally agree that being able to deal with scalars instead of just arrays adds huge room for optimization.   I have always thought that always needing intermediate arrays slowed down the processing of DataFusion significantly, for certain cases, and is also cache unfriendly.
   
   On the other hand, I hear what @alamb and others are saying that it adds complexity to what is already nontrivial.   I agree with that.
   
   I wonder if it is possible to get the best of both worlds, by extending the `Array` trait slightly and having a subclass of `Array` which denotes scalars, like `ScalarArray` which do not need `Buffer` storage and just represents constant scalars.   This way, functions would only need to deal with Array, but can recognize this subclass `ScalarArray` and do optimizations that way.   This is a very half-formed thought at the moment.   The train of thought is just what if the `Array` was not strictly a buffer-based representation but just a way to access columnar data, and in certain cases represents scalars.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778382279


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=h1) Report
   > Merging [#9376](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=desc) (41e8f26) into [master](https://codecov.io/gh/apache/arrow/commit/5e3fcfabf471fd3790e114b2245690c9c08ff743?el=desc) (5e3fcfa) will **decrease** coverage by `0.17%`.
   > The diff coverage is `74.50%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9376/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9376      +/-   ##
   ==========================================
   - Coverage   82.32%   82.14%   -0.18%     
   ==========================================
     Files         233      235       +2     
     Lines       54446    54582     +136     
   ==========================================
   + Hits        44823    44838      +15     
   - Misses       9623     9744     +121     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/builder.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYnVpbGRlci5ycw==) | `84.93% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3NvcnQucnM=) | `93.56% <ø> (ø)` | |
   | [rust/arrow/src/compute/kernels/substring.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS9rZXJuZWxzL3N1YnN0cmluZy5ycw==) | `98.29% <ø> (ø)` | |
   | [rust/arrow/src/compute/util.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvY29tcHV0ZS91dGlsLnJz) | `98.92% <ø> (ø)` | |
   | [rust/arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvanNvbi9yZWFkZXIucnM=) | `82.82% <ø> (ø)` | |
   | [rust/benchmarks/src/bin/tpch.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9iZW5jaG1hcmtzL3NyYy9iaW4vdHBjaC5ycw==) | `38.33% <0.00%> (+0.18%)` | :arrow_up: |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `80.00% <ø> (ø)` | |
   | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `93.83% <ø> (ø)` | |
   | [rust/datafusion/src/logical\_plan/dfschema.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vZGZzY2hlbWEucnM=) | `85.52% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==) | `85.53% <ø> (-0.32%)` | :arrow_down: |
   | ... and [55 more](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=footer). Last update [34e7671...cd79e6e](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419585



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -49,58 +58,142 @@ fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> {
     digest.finalize()
 }
 
-macro_rules! crypto_unary_string_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericStringArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
-            }
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+fn unary_binary_function<T, R, F>(
+    args: &[&dyn Array],
+    op: F,
+    name: &str,
+) -> Result<BinaryArray>
+where
+    R: AsRef<[u8]>,
+    T: StringOffsetSizeTrait,
+    F: Fn(&str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
 
-            let array = args[0]
-                .as_any()
-                .downcast_ref::<GenericStringArray<T>>()
-                .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())
+}
+
+fn handle<F, R>(args: &[ColumnarValue], op: F, name: &str) -> Result<ColumnarValue>
+where
+    R: AsRef<[u8]>,
+    F: Fn(&str) -> R,
+{
+    match &args[0] {
+        ColumnarValue::Array(a) => match a.data_type() {
+            DataType::Utf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i32,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i64,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function {}",
+                other, name,
+            ))),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());
+                Ok(ColumnarValue::Scalar(ScalarValue::Binary(result)))
+            }
+            ScalarValue::LargeUtf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());
+                Ok(ColumnarValue::Scalar(ScalarValue::Binary(result)))
+            }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function {}",
+                other, name,
+            ))),
+        },
+    }
+}
 
-            // first map is the iterator, second is for the `Option<_>`
-            Ok(array.iter().map(|x| x.map(|x| $FUNC(x))).collect())
-        }
-    };
+fn md5_array<T: StringOffsetSizeTrait>(
+    args: &[&dyn Array],
+) -> Result<GenericStringArray<i32>> {
+    unary_string_function::<T, i32, _, _>(args, md5_process, "md5")
 }
 
-macro_rules! crypto_unary_binary_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Binary
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericBinaryArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
+/// crypto function that accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]
+pub fn md5(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+    match &args[0] {
+        ColumnarValue::Array(a) => match a.data_type() {
+            DataType::Utf8 => Ok(ColumnarValue::Array(Arc::new(md5_array::<i32>(&[
+                a.as_ref()
+            ])?))),
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(Arc::new(md5_array::<i64>(&[
+                    a.as_ref()
+                ])?)))
             }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function md5",
+                other,
+            ))),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| md5_process(x));
+                Ok(ColumnarValue::Scalar(ScalarValue::Utf8(result)))
+            }
+            ScalarValue::LargeUtf8(a) => {
+                let result = a.as_ref().map(|x| md5_process(x));

Review comment:
       ```suggestion
                   let result = a.as_ref().map(md5_process);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770335798


   @alamb , when you wrote the `date_trunc`, did you had in mind having different granularities per row, or was it driven by the fact the the builtin functions only accepted arrays?
   
   If the latter, do you think it would be ok to require granularity to always be a scalar as a regression / feature of this PR?
   
   I am asking because `date_trunc(array | scalar, array | scalar)` is a "bit" more effort than `date_trunc(scalar, array | scalar)`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] maxburke commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
maxburke commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-774526122


   > @andygrove / @Dandandan / @seddonm1 / @maxburke / @jhorstmann -- what are your thoughts on this one?
   > 
   > It is a significant enough change I think someone using DataFusion in their projects, other than myself should weigh in. I like it a lot -- and I think it could serve as the basis for constant folding (e.g. transform `A + (5 + 7)` --> `A + 12` at plan time)
   
   On the surface this looks like it'll be a fantastic change to have; I am curious to see what the measured impact on query times will be.
   
   (cc @velvia / @mcassels)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771427446


   Ok, I have addressed the comments.
   
   The API change to UDFs is that people need to call `make_scalar_function` on their existing UDFs, as seen in diff of the example.
   
   Out of curiosity, did anyone run the benchmarks? I do not have a machine suitable for that, but I am obviously curious :)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] velvia commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
velvia commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778693797


   Commenting since @maxburke pinged me.
   
   On the surface I think this is a great change from the performance perspective.  I totally agree that being able to deal with scalars instead of just arrays adds huge room for optimization.   I have always thought that always needing intermediate arrays slowed down the processing of DataFusion significantly, for certain cases, and is also cache unfriendly.
   
   On the other hand, I hear what @alamb and others are saying that it adds complexity to what is already nontrivial.   I agree with that.
   
   I wonder if it is possible to get the best of both worlds, by extending the `Array` trait slightly and having a subclass of `Array` which denotes scalars, like `ScalarArray` which do not need `Buffer` storage and just represents constant scalars.   This way, functions would only need to deal with Array, but can recognize this subclass `ScalarArray` and do optimizations that way.   This is a very half-formed thought at the moment.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-774459834


   I think changes here look good, and will be helpful to implement functions which typically operate on `(scalar, array)`. And this will help to make those cases (quite a bit) more efficient. I think that can be particularly helpful when having complex expressions in `WHERE` clauses / projections.
   
   @alamb I am not sure I understand why folding should be done on the physical plan level or really depending on this PR , it should be possible without changes in this PR (just by having some `Expr -> Expr`  rules)? They could share the same evaluation code though?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778382279


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=h1) Report
   > Merging [#9376](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=desc) (27b01cf) into [master](https://codecov.io/gh/apache/arrow/commit/88e9eb852755072b948393d99d08211c3e01ce38?el=desc) (88e9eb8) will **decrease** coverage by `0.16%`.
   > The diff coverage is `64.97%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9376/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9376      +/-   ##
   ==========================================
   - Coverage   82.27%   82.11%   -0.17%     
   ==========================================
     Files         234      235       +1     
     Lines       54594    54714     +120     
   ==========================================
   + Hits        44919    44929      +10     
   - Misses       9675     9785     +110     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `93.83% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==) | `85.53% <ø> (-0.32%)` | :arrow_down: |
   | [...st/datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL21vZC5ycw==) | `71.42% <ø> (-18.58%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21vZC5ycw==) | `86.00% <ø> (ø)` | |
   | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `51.63% <19.14%> (-5.11%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/crypto\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NyeXB0b19leHByZXNzaW9ucy5ycw==) | `52.45% <43.13%> (-22.55%)` | :arrow_down: |
   | [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz) | `69.17% <64.60%> (-24.12%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `67.12% <68.57%> (-19.84%)` | :arrow_down: |
   | [.../datafusion/src/physical\_plan/array\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FycmF5X2V4cHJlc3Npb25zLnJz) | `43.33% <77.77%> (+11.51%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/expressions/nullif.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL251bGxpZi5ycw==) | `86.88% <86.88%> (ø)` | |
   | ... and [8 more](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=footer). Last update [88e9eb8...27b01cf](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770268175


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568348305



##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       The pattern depends on the input and output of the function. I.e. when input is `&str`, then `Utf8/LargeUtf8`. When output is a `Native`, then the output is `PrimitiveArray<O::Native>`. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
   
   I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (`string_to_timestamp_nanos` in this case).
   
   In `crypto_expressions` we have a similar pattern, but in this case the function is `&str -> AsRef<[u8]>`, which allowed to write all cripto `sha` in a succinct manner. However, in that case, the output type is always `Binary` instead of `LargeBinary` for `LargeStringArray`, because the hashes are always smaller than `i32::MAX`. All of this was already written, I just expanded it for the two variants (scalar and vector).
   
   Note that `crypto_expressions::handle` and `crypto_expressions::md5` are very similar, but their return types are different: `handle` receives a `GenericStringArray`, but returns a `BinaryArray`. This is because `MD5`'s signature is `string -> string`, while `sha` is `string -> binary`.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770408614


   This is ready for a first take.
   
   Some notes:
   
   * This changes the UDF's API: they should now handle `ColumnarValue`, not `ArrayRef`.
   * whenever possible, I used generics instead of macros. I have been reasoning about code more easily with generics because they set the traits explicitly and are therefore IMO easier to use.
   * I had to add a new trait to bridge `ScalarValue` and `ArrowPrimitiveType`
   * I had to implement the `TryFrom` for the timestampNanoseconds.
   * I have not added test coverage to the scalar cases of our existing functions. Longer term, I think we should develop a generic to handle this without having to test the two cases. I tried to do something like this, but I still had to copy-paste some code between generics, unfortunately.
   
   Sorry for the long PR, but I did not find an easy way to split it up (besides #9378, which this PR is built on top of).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771887859


   @jorgecarleitao what kind of benchmarks are you interested in? AFAIK, most benchmarks are not very depending on this, I expect it being mostly impactfully in cases where the projection itself is a lot of the time, but most benchmarks are spending most of the time on joins/aggregates.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770335798


   @alamb , when you wrote the `date_trunc`, did you had in mind having different granularities per array, or was it driven by the fact the the builtin functions only accepted arrays?
   
   If the latter, do you think it would be ok to require granularity to always be a scalar as a regression / feature of this PR?
   
   I am asking because `date_trunc(array | scalar, array | scalar)` is a "bit" more effort than `date_trunc(scalar, array | scalar)`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771381944


   Hi @alamb , thanks a lot for your review and comments. I generally agree that this makes it more difficult to write a UDF and a function implementation in general.
   
   I like that idea. I have now pushed a commit with it. I will now address the remaining ones.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419354



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -49,58 +58,142 @@ fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> {
     digest.finalize()
 }
 
-macro_rules! crypto_unary_string_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericStringArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
-            }
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+fn unary_binary_function<T, R, F>(
+    args: &[&dyn Array],
+    op: F,
+    name: &str,
+) -> Result<BinaryArray>
+where
+    R: AsRef<[u8]>,
+    T: StringOffsetSizeTrait,
+    F: Fn(&str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
 
-            let array = args[0]
-                .as_any()
-                .downcast_ref::<GenericStringArray<T>>()
-                .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())
+}
+
+fn handle<F, R>(args: &[ColumnarValue], op: F, name: &str) -> Result<ColumnarValue>
+where
+    R: AsRef<[u8]>,
+    F: Fn(&str) -> R,
+{
+    match &args[0] {
+        ColumnarValue::Array(a) => match a.data_type() {
+            DataType::Utf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i32,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i64,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function {}",
+                other, name,
+            ))),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());
+                Ok(ColumnarValue::Scalar(ScalarValue::Binary(result)))
+            }
+            ScalarValue::LargeUtf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());
+                Ok(ColumnarValue::Scalar(ScalarValue::Binary(result)))
+            }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function {}",
+                other, name,
+            ))),
+        },
+    }
+}
 
-            // first map is the iterator, second is for the `Option<_>`
-            Ok(array.iter().map(|x| x.map(|x| $FUNC(x))).collect())
-        }
-    };
+fn md5_array<T: StringOffsetSizeTrait>(
+    args: &[&dyn Array],
+) -> Result<GenericStringArray<i32>> {
+    unary_string_function::<T, i32, _, _>(args, md5_process, "md5")
 }
 
-macro_rules! crypto_unary_binary_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Binary
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericBinaryArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
+/// crypto function that accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]
+pub fn md5(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+    match &args[0] {
+        ColumnarValue::Array(a) => match a.data_type() {
+            DataType::Utf8 => Ok(ColumnarValue::Array(Arc::new(md5_array::<i32>(&[
+                a.as_ref()
+            ])?))),
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(Arc::new(md5_array::<i64>(&[
+                    a.as_ref()
+                ])?)))
             }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function md5",
+                other,
+            ))),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| md5_process(x));

Review comment:
       ```suggestion
                   let result = a.as_ref().map(md5_process);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568343902



##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -54,50 +58,76 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
+// a small utility function to compute pow(base, exponent)
+fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
+    match (base, exponent) {
+        // in arrow, any value can be null.
+        // Here we decide to make our UDF to return null when either base or exponent is null.
+        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
+        _ => None,
+    }
+}
+
+fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
+    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+    let base = base
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+    let exponent = exponent
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+
+    // this is guaranteed by DataFusion. We place it just to make it obvious.
+    assert_eq!(exponent.len(), base.len());
+
+    // 2. perform the computation
+    let array = base
+        .iter()
+        .zip(exponent.iter())
+        .map(|(base, exponent)| maybe_pow(&base, &exponent))
+        .collect::<Float64Array>();
+
+    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)

Review comment:
       This represents an open interval: `0 <= exponent < 1` because `(x)^1 = x` for all x (including negative ones).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771186423


   > Having been trying to implement a lot of the Postgres logic recently they all logically support the use of arrays for each of the parameters (i.e. date_trunc(array, array)) I have no idea how common the use of this functionality really is - and I suspect it is low - but if we are building the API for Postgres compatibility it would be good to solve.
   
   This is a good point @seddonm1 - I am just starting to look at this pR, but I wonder if we could do something like have a default implementation for `func(scalar, scalar)`, `func(scalar, array)`, and `func(array, scalar)` that are implemented in terms of the `func(array, array)`
   
   that way we would only have to supply `func(array, array)` and the scalar input case could be handled on a case by case basis. I'll try and commend inline if I see how that might be done


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419291



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -49,58 +58,142 @@ fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> {
     digest.finalize()
 }
 
-macro_rules! crypto_unary_string_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericStringArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
-            }
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+fn unary_binary_function<T, R, F>(
+    args: &[&dyn Array],
+    op: F,
+    name: &str,
+) -> Result<BinaryArray>
+where
+    R: AsRef<[u8]>,
+    T: StringOffsetSizeTrait,
+    F: Fn(&str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
 
-            let array = args[0]
-                .as_any()
-                .downcast_ref::<GenericStringArray<T>>()
-                .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())
+}
+
+fn handle<F, R>(args: &[ColumnarValue], op: F, name: &str) -> Result<ColumnarValue>
+where
+    R: AsRef<[u8]>,
+    F: Fn(&str) -> R,
+{
+    match &args[0] {
+        ColumnarValue::Array(a) => match a.data_type() {
+            DataType::Utf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i32,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            DataType::LargeUtf8 => {
+                Ok(ColumnarValue::Array(Arc::new(unary_binary_function::<
+                    i64,
+                    _,
+                    _,
+                >(
+                    &[a.as_ref()], op, name
+                )?)))
+            }
+            other => Err(DataFusionError::Internal(format!(
+                "Unsupported data type {:?} for function {}",
+                other, name,
+            ))),
+        },
+        ColumnarValue::Scalar(scalar) => match scalar {
+            ScalarValue::Utf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());
+                Ok(ColumnarValue::Scalar(ScalarValue::Binary(result)))
+            }
+            ScalarValue::LargeUtf8(a) => {
+                let result = a.as_ref().map(|x| (op)(x).as_ref().to_vec());

Review comment:
       ```suggestion
                   let result = a.as_ref().map(|x| op(x).as_ref().to_vec());
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419011



##########
File path: rust/datafusion/src/physical_plan/functions.rs
##########
@@ -622,21 +560,66 @@ impl PhysicalExpr for ScalarFunctionExpr {
         let inputs = self
             .args
             .iter()
-            .map(|e| e.evaluate(batch).map(|v| v.into_array(batch.num_rows())))
+            .map(|e| e.evaluate(batch))
             .collect::<Result<Vec<_>>>()?;
 
         // evaluate the function
         let fun = self.fun.as_ref();
-        (fun)(&inputs).map(|a| ColumnarValue::Array(a))
+        (fun)(&inputs)
     }
 }
 
+/// decorates a function to handle [`ScalarValue`]s by coverting them to arrays before calling the function
+/// and vice-versa after evaluation.
+pub fn make_scalar_function<F>(inner: F) -> ScalarFunctionImplementation
+where
+    F: Fn(&[ArrayRef]) -> Result<ArrayRef> + Sync + Send + 'static,
+{
+    Arc::new(move |args: &[ColumnarValue]| {
+        // first, identify if any of the arguments is an Array. If yes, store its `len`,
+        // as any scalar will need to be converted to an array of len `len`.
+        let len = args
+            .iter()
+            .fold(Option::<usize>::None, |acc, arg| match arg {
+                ColumnarValue::Scalar(_) => acc,
+                ColumnarValue::Array(a) => Some(a.len()),
+            });
+
+        // to array
+        let args = if let Some(len) = len {
+            args.iter()
+                .map(|arg| arg.clone().into_array(len))
+                .collect::<Vec<ArrayRef>>()
+        } else {
+            args.iter()
+                .map(|arg| arg.clone().into_array(1))
+                .collect::<Vec<ArrayRef>>()
+        };
+
+        let result = (inner)(&args);

Review comment:
       ```suggestion
           let result = inner(&args);
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9376: WIP: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770292349


   @jorgecarleitao yes this looks good 👍 
   
   This is going to take some time to apply but I like the idea of passing a function to apply to the string (for example). It would be good to get this merged soon so I can apply to the big Postgres functions PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778382279


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=h1) Report
   > Merging [#9376](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=desc) (ca5bf89) into [master](https://codecov.io/gh/apache/arrow/commit/88e9eb852755072b948393d99d08211c3e01ce38?el=desc) (88e9eb8) will **decrease** coverage by `0.16%`.
   > The diff coverage is `64.89%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/9376/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #9376      +/-   ##
   ==========================================
   - Coverage   82.27%   82.11%   -0.17%     
   ==========================================
     Files         234      235       +1     
     Lines       54594    54715     +121     
   ==========================================
   + Hits        44919    44931      +12     
   - Misses       9675     9784     +109     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/datafusion/src/execution/dataframe\_impl.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9leGVjdXRpb24vZGF0YWZyYW1lX2ltcGwucnM=) | `93.83% <ø> (ø)` | |
   | [...datafusion/src/physical\_plan/expressions/binary.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL2JpbmFyeS5ycw==) | `85.53% <ø> (-0.32%)` | :arrow_down: |
   | [...st/datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL21vZC5ycw==) | `71.42% <ø> (-18.58%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL21vZC5ycw==) | `86.00% <ø> (ø)` | |
   | [rust/datafusion/src/scalar.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9zY2FsYXIucnM=) | `51.63% <19.14%> (-5.11%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/crypto\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2NyeXB0b19leHByZXNzaW9ucy5ycw==) | `52.45% <43.13%> (-22.55%)` | :arrow_down: |
   | [...tafusion/src/physical\_plan/datetime\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2RhdGV0aW1lX2V4cHJlc3Npb25zLnJz) | `69.40% <64.28%> (-23.89%)` | :arrow_down: |
   | [...datafusion/src/physical\_plan/string\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3N0cmluZ19leHByZXNzaW9ucy5ycw==) | `67.12% <68.57%> (-19.84%)` | :arrow_down: |
   | [.../datafusion/src/physical\_plan/array\_expressions.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2FycmF5X2V4cHJlc3Npb25zLnJz) | `43.33% <77.77%> (+11.51%)` | :arrow_up: |
   | [...datafusion/src/physical\_plan/expressions/nullif.rs](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL2V4cHJlc3Npb25zL251bGxpZi5ycw==) | `86.88% <86.88%> (ø)` | |
   | ... and [9 more](https://codecov.io/gh/apache/arrow/pull/9376/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=footer). Last update [88e9eb8...ca5bf89](https://codecov.io/gh/apache/arrow/pull/9376?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778716899


   > I wonder if it is possible to get the best of both worlds, by extending the `Array` trait slightly and having a subclass of `Array` which denotes scalars, like `ScalarArray` which do not need `Buffer` storage and just represents constant scalars. This way, functions would only need to deal with Array, but can recognize this subclass `ScalarArray` and do optimizations that way. This is a very half-formed thought at the moment. The train of thought is just what if the `Array` was not strictly a buffer-based representation but just a way to access columnar data, and in certain cases represents scalars.
   
   I also have many instances where knowing the `ScalarArray` vs `Array` would provide huge performance opportunities in the big PR implement Postgres functions : https://github.com/apache/arrow/pull/9243 - especially for things like Regex matching. It would be relatively trivial to implement if it could be pattern matched.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778761128


   > Note that this PR addresses @alamb 's concern by introducing a adapter that people can use if they do not want to bother implementing the scalar variants.
   
   Yes, the adapter function I think lowers the barrier to implementing initial scalar functions and should be an easier upgrade path.
   
   TLDR it is that one can call `make_scalar_function(your_old_func_implementation)` and not worry about the updates.
   
   I think this PR is ready, is a great step forward. Let's merge it and continue to iterate.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568348305



##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       The pattern depends on the input and output of the function. I.e. when input is `&str`, then `Utf8/LargeUtf8`. When output is a `Native`, then the output is `PrimitiveArray<O::Native>`. In general this construct depends on the what the user is trying to achieve (wrt to input and output types).
   
   I placed this here because it allows to decouple the pattern (of handling Scalar and Array) from the implementation of the logic (`string_to_timestamp_nanos` in this case).
   
   In `crypto_expressions` we have a similar pattern, but in this case the function is `&str -> AsRef<[u8]>`, which allowed to write all cripto `sha` in a succinct manner. However, in that case, the output type is always `Binary` instead of `LargeBinary` for `LargeStringArray`, because the hashes are always smaller than `i32::MAX`. All of this was already written, I just expanded it for the two variants (scalar and vector).
   
   Note that `crypto_expressions::handle` and `crypto_expressions::md5` are very similar, but their return types are different: `unary_binary_function` receives a `GenericStringArray`, but returns a `BinaryArray`. This is because `MD5`'s signature is `string -> string`, while `sha` is `string -> binary`.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-770392610


   https://issues.apache.org/jira/browse/ARROW-11446


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r571419564



##########
File path: rust/datafusion/src/physical_plan/crypto_expressions.rs
##########
@@ -49,58 +58,142 @@ fn sha_process<D: SHA2Digest + Default>(input: &str) -> SHA2DigestOutput<D> {
     digest.finalize()
 }
 
-macro_rules! crypto_unary_string_function {
-    ($NAME:ident, $FUNC:expr) => {
-        /// crypto function that accepts Utf8 or LargeUtf8 and returns Utf8 string
-        pub fn $NAME<T: StringOffsetSizeTrait>(
-            args: &[ArrayRef],
-        ) -> Result<GenericStringArray<i32>> {
-            if args.len() != 1 {
-                return Err(DataFusionError::Internal(format!(
-                    "{:?} args were supplied but {} takes exactly one argument",
-                    args.len(),
-                    String::from(stringify!($NAME)),
-                )));
-            }
+/// # Errors
+/// This function errors when:
+/// * the number of arguments is not 1
+/// * the first argument is not castable to a `GenericStringArray`
+fn unary_binary_function<T, R, F>(
+    args: &[&dyn Array],
+    op: F,
+    name: &str,
+) -> Result<BinaryArray>
+where
+    R: AsRef<[u8]>,
+    T: StringOffsetSizeTrait,
+    F: Fn(&str) -> R,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .ok_or_else(|| {
+            DataFusionError::Internal("failed to downcast to string".to_string())
+        })?;
 
-            let array = args[0]
-                .as_any()
-                .downcast_ref::<GenericStringArray<T>>()
-                .unwrap();
+    // first map is the iterator, second is for the `Option<_>`
+    Ok(array.iter().map(|x| x.map(|x| op(x))).collect())

Review comment:
       ```suggestion
       Ok(array.iter().map(|x| x.map(op)).collect())
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778137269


   Sounds good -- @jorgecarleitao  let's get it merged ! It looks like it needs another rebase and then we'll get it in


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-774460448


   > @alamb I am not sure I understand why folding should be done on the physical plan level or really depending on this PR , it should be possible without changes in this PR (just by having some Expr -> Expr rules)? They could share the same evaluation code though?
   
   @Dandandan  -- yes, I think constant folding looks like `Expr` --> `Expr`. But as you hint at, if the physical evaluation of functions is entirely separate from the plan time evaluation, we will end up with two parallel implementations for evaluation that need to be kept in sync (one for `Expr` and one for the physical runtime) -- using the same implementation for both I think avoids a lot of potential inconsistency (and source of bugs)
   
   We'll see if that turns out to be possible, but I think it should be


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771381944






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] seddonm1 commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
seddonm1 commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771190985


   @alamb here is a proof-of-concept for the regexp_replace Postgres function which has been built to support the possibility of passing in different function parameters for each row. If it were possible to tell whether the value was a Scalar or Array there would be major optimisation opportunities. I did some basic memoization of the Regex objects but that would not be as necessary if we knew Scalar v Array.
   
   https://github.com/apache/arrow/pull/9243/files#diff-abe8768fe7124198cca7a84ad7b2c678b3cc8e5de3d1bc867d498536a2fdddc7R542


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#discussion_r568532324



##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       makes sense -- thank you for the explination

##########
File path: rust/datafusion/src/physical_plan/datetime_expressions.rs
##########
@@ -167,152 +175,166 @@ fn naive_datetime_to_timestamp(s: &str, datetime: NaiveDateTime) -> Result<i64>
     }
 }
 
-/// convert an array of strings into `Timestamp(Nanosecond, None)`
-pub fn to_timestamp(args: &[ArrayRef]) -> Result<TimestampNanosecondArray> {
-    let num_rows = args[0].len();
-    let string_args =
-        &args[0]
-            .as_any()
-            .downcast_ref::<StringArray>()
-            .ok_or_else(|| {
-                DataFusionError::Internal(
-                    "could not cast to_timestamp input to StringArray".to_string(),
-                )
-            })?;
-
-    let result = (0..num_rows)
-        .map(|i| {
-            if string_args.is_null(i) {
-                // NB: Since we use the same null bitset as the input,
-                // the output for this value will be ignored, but we
-                // need some value in the array we are building.
-                Ok(0)
-            } else {
-                string_to_timestamp_nanos(string_args.value(i))
+pub(crate) fn unary_string_to_primitive_function<'a, T, O, F>(
+    args: &[&'a dyn Array],
+    op: F,
+    name: &str,
+) -> Result<PrimitiveArray<O>>
+where
+    O: ArrowPrimitiveType,
+    T: StringOffsetSizeTrait,
+    F: Fn(&'a str) -> Result<O::Native>,
+{
+    if args.len() != 1 {
+        return Err(DataFusionError::Internal(format!(
+            "{:?} args were supplied but {} takes exactly one argument",
+            args.len(),
+            name,
+        )));
+    }
+
+    let array = args[0]
+        .as_any()
+        .downcast_ref::<GenericStringArray<T>>()
+        .unwrap();
+
+    // first map is the iterator, second is for the `Option<_>`
+    array.iter().map(|x| x.map(|x| op(x)).transpose()).collect()
+}
+
+fn handle<'a, O, F, S>(

Review comment:
       makes sense -- thank you for the explanation

##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -58,77 +54,54 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
-// a small utility function to compute pow(base, exponent)
-fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
-    match (base, exponent) {
-        // in arrow, any value can be null.
-        // Here we decide to make our UDF to return null when either base or exponent is null.
-        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
-        _ => None,
-    }
-}
-
-fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
-    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
-    let base = base
-        .as_any()
-        .downcast_ref::<Float64Array>()
-        .expect("cast failed");
-    let exponent = exponent
-        .as_any()
-        .downcast_ref::<Float64Array>()
-        .expect("cast failed");
-
-    // this is guaranteed by DataFusion. We place it just to make it obvious.
-    assert_eq!(exponent.len(), base.len());
-
-    // 2. perform the computation
-    let array = base
-        .iter()
-        .zip(exponent.iter())
-        .map(|(base, exponent)| maybe_pow(&base, &exponent))
-        .collect::<Float64Array>();
-
-    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)
-    // `Arc` because arrays are immutable, thread-safe, trait objects.
-    Ok(Arc::new(array))
-}
-
 /// In this example we will declare a single-type, single return type UDF that exponentiates f64, a^b
 #[tokio::main]
 async fn main() -> Result<()> {
     let mut ctx = create_context()?;
 
     // First, declare the actual implementation of the calculation
-    let pow: ScalarFunctionImplementation = Arc::new(|args: &[ColumnarValue]| {
-        // in DataFusion, all `args` and output are `ColumnarValue`, an enum of either a scalar or a dynamically-typed array.
-        // we can cater for both, or document that the UDF only supports some variants.
-        // here we will assume that al
+    let pow = |args: &[ArrayRef]| {
+        // in DataFusion, all `args` and output are dynamically-typed arrays, which means that we need to:
         // 1. cast the values to the type we want
         // 2. perform the computation for every element in the array (using a loop or SIMD) and construct the result
 
         // this is guaranteed by DataFusion based on the function's signature.
         assert_eq!(args.len(), 2);
 
-        let (base, exponent) = (&args[0], &args[1]);
-
-        let result = match (base, exponent) {
-            (
-                ColumnarValue::Scalar(ScalarValue::Float64(base)),
-                ColumnarValue::Scalar(ScalarValue::Float64(exponent)),
-            ) => ColumnarValue::Scalar(ScalarValue::Float64(maybe_pow(base, exponent))),
-            (ColumnarValue::Array(base), ColumnarValue::Array(exponent)) => {
-                let array = pow_array(base.as_ref(), exponent.as_ref())?;
-                ColumnarValue::Array(array)
-            }
-            _ => {
-                return Err(DataFusionError::Execution(
-                    "This UDF only supports f64".to_string(),
-                ))
-            }
-        };
-        Ok(result)
-    });
+        // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+        let base = &args[0]
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("cast failed");
+        let exponent = &args[1]
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .expect("cast failed");
+
+        // this is guaranteed by DataFusion. We place it just to make it obvious.
+        assert_eq!(exponent.len(), base.len());
+
+        // 2. perform the computation
+        let array = base
+            .iter()
+            .zip(exponent.iter())
+            .map(|(base, exponent)| {
+                match (base, exponent) {
+                    // in arrow, any value can be null.
+                    // Here we decide to make our UDF to return null when either base or exponent is null.
+                    (Some(base), Some(exponent)) => Some(base.powf(exponent)),
+                    _ => None,
+                }
+            })
+            .collect::<Float64Array>();
+
+        // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)
+        // `Arc` because arrays are immutable, thread-safe, trait objects.
+        Ok(Arc::new(array) as ArrayRef)
+    };
+    // the function above expects an `ArrayRef`, but DataFusion may pass a scalar to a UDF.
+    // thus, we use `make_scalar_function` to decorare the closure so that it can handle both Arrays and Scalar values.

Review comment:
       👍 

##########
File path: rust/datafusion/examples/simple_udf.rs
##########
@@ -54,50 +58,76 @@ fn create_context() -> Result<ExecutionContext> {
     Ok(ctx)
 }
 
+// a small utility function to compute pow(base, exponent)
+fn maybe_pow(base: &Option<f64>, exponent: &Option<f64>) -> Option<f64> {
+    match (base, exponent) {
+        // in arrow, any value can be null.
+        // Here we decide to make our UDF to return null when either base or exponent is null.
+        (Some(base), Some(exponent)) => Some(base.powf(*exponent)),
+        _ => None,
+    }
+}
+
+fn pow_array(base: &dyn Array, exponent: &dyn Array) -> Result<ArrayRef> {
+    // 1. cast both arguments to f64. These casts MUST be aligned with the signature or this function panics!
+    let base = base
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+    let exponent = exponent
+        .as_any()
+        .downcast_ref::<Float64Array>()
+        .expect("cast failed");
+
+    // this is guaranteed by DataFusion. We place it just to make it obvious.
+    assert_eq!(exponent.len(), base.len());
+
+    // 2. perform the computation
+    let array = base
+        .iter()
+        .zip(exponent.iter())
+        .map(|(base, exponent)| maybe_pow(&base, &exponent))
+        .collect::<Float64Array>();
+
+    // `Ok` because no error occurred during the calculation (we should add one if exponent was [0, 1[ and the base < 0 because that panics!)

Review comment:
       cool -- sorry I guess I am used to seeing an open interval  using a `)` --so in this case something like `[0, 1)` to represent `0 <= exponent < 1` (e.g. [here](https://en.wikipedia.org/wiki/Interval_(mathematics)#Terminology)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-778723732


   @velvia , @seddonm1  My opinion atm is that would not really help much from the compute side, as we would still need to write separate logic for the 4 cases `(Array, scalar)`, `(scalar, Array)`, `(scalar, scalar)`, `(array, array)`, which is exactly what this PR already proposes.
   
   For example, a subtraction of two arrays goes along the lines of
   
   ```rust
   let iter = lhs.values().zip(rhs.values()).map(|(r, l)| r - l);
   let array = unsafe { PrimitiveArray<T>::from_trusted_len_iter(iter) };
   // .values() is a slice
   ```
   
   If we had an implementation of `PrimitiveScalarArray<T>`, we could change `.values()` to be an iterator, but then we lose the benefits of contiguous regions, as Rust can no longer assume that `values` is a contiguous memory region (slices have that property, iterators do not). If both sides are scalars, we want to to create a `PrimitiveScalarArray<T>`, so, again we need to branch. In all cases, we need a match `(lhs, rhs)`. I think that DataFusion in this respect is doing things right. 
   
   To change the Array trait we basically need a change in the arrow specification so that all implementations agree on how to communicate such information via IPC, ffi, etc, so, that is mailing list material :)
   
   Note that this PR addresses @alamb 's concern by introducing a adapter that people can use if they do not want to bother implementing the scalar variants.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9376: ARROW-11446: [DataFusion] Added support for scalarValue in Builtin functions.

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9376:
URL: https://github.com/apache/arrow/pull/9376#issuecomment-771887859


   @jorgecarleitao what kind of benchmarks are you interested in? AFAIK, most benchmarks are not very depending on this, I expect it being mostly impactfully in cases where the projection itself is a lot of the time, but most benchmarks are spending most of the time on joins/aggregates.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org