You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/23 15:25:07 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

jorgecarleitao opened a new pull request #8032:
URL: https://github.com/apache/arrow/pull/8032


   See associated issue and document for details.
   
   The gist is that currently, users call UDFs through
   
   ``` 
   df.select(scalar_functions(“sqrt”, vec![col(“a”)], DataType::Float64))
   ```
    
   and this PR proposes a change to 
    
   ```
   let functions = df.registry()?;
   
   df.select(functions.udf(“sqrt”, vec![col(“a”)])?)
   ```
    
   so that they do not have to remember the UDFs return type when using it (and a whole lot other things for us internally). The `df` part is still not implemented. Currently it only works with `ctx.registry()?`.
   
   This PR is waiting for some changes in the `ExecutionContext`, for it to be easier to be integrated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-680244598


   @alamb thank you very much for your comments, I will now work on addressing them now. I still learning the Arc/Box/Ref, so thank you a lot for also teaching me.
   
   @andygrove , I agree with you that built-in functions should not require access to the registry. Unfortunately, doing so required some re-work, which is the reason I retracted #7967 back to draft to focus on this one first.
   
   I pushed a new commit to this PR to address this point. Specifically, that commit adds:
   
   * a new enum with all built-in functions
   * functionally gluing the logical plan with the physical plan so that the function's return types are invariant.
   * made type coercion on built-in functions to be on the physical plane, to preserve schema invariance during planning.
   
   I am pretty happy with this PR, as IMO has the flexibility we need to expand DataFusion's pool of built-in functions to multiple input and return types. The main features of this PR:
   
   * users no longer have to pass the return type of the UDF when calling them (the proposal)
   * planning built-in functions continue to not need access to the registry (@andygrove 's point)
   * built-in functions now support multiple input types (e.g. `sqrt(f32)`, `sqrt(f64)`)
   * built-in functions now support multiple return types (e.g. `sqrt(f32) -> f32`, `sqrt(f64) -> f64`)
   * coercion rules are no longer applied in the sql planning or physical planning to built-in functions, to avoid breaking schema invariance during planning
   
   I have not completed the valid return types of built-in math functions as this PR was already too long.
   
   Overall, I think that this has not been a pleasant experience for you @andygrove and @alamb, as I constantly open and close PRs around functions/UDFs, and for that I am really sorry. I've been hitting some design challenge after another, which requires me to go back and forth.
   
   I am still in pursuit of my original quests:
   
   * built-in aggregate functions whose logical types are known from the physical expressions
   * type coercion on aggregate functions
   * built-in aggregate functions whose return types (e.g. `min(f32) -> f32`, `min(f64) -> f64`) are directly derived from the physical plan (there is an old fixme/todo in the code around that)
   * aggregate udfs
   * udfs with multiple incoming and return types, to bring them to the same level of functionality of built-ins
   * planning a udf without registering it (a-la spark) in the DF's API.
    
   I have code for some of this, I... just... need... to... finish... the... scalar... stuff... first...  😃 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-679631810


   > For built-in functions like `sqrt` I would expect DataFusion to provide convenience functions to create an expression, like we do with `col` and the aggregate functions. I assume we could also do that with the design proposed here?
   > 
   > For example, I would like to be able to write:
   > 
   > ```rust
   > df.select(vec![col("foo"), sqrt(col("bar"))])?
   > ```
   
   This PR does not support this, as it threats every function (built-in or not) equally. To include that case, IMO this PR needs to add a new enum in the logical `Expr`:
   
   * `Expr::ScalarFunction { name: String/Enum, args: Vec<Expr> }` that we logically know its return type based on `name` (e.g. `"sqrt"`), exactly like `Expr::BinaryExpr`. This is mapped to a physical expression during planning. These can be build without access to the registry, as we hard-cod the return type on the logical plan to be consistent with the physical one, like we do for our aggregates, binary expressions, etc.
   * `Expr:ScalarUDF { fun: ScalarFunction, args: Vec<Expr> }`, whose return type is only known after going to the registry to check what the user set its return type to be (as this PR currently does).
   
   `sqrt` would return `Expr::Function`, that knows its own return type, and the planner converts it to `ScalarFunction` via a hard-coded map, while `Expr:UDF`'s physical planning is just planning `args` and pass them to `ScalarFunction` like this PR already does.
   
   I.e. at the physical level, built-in and UDFs are indistinguishable, but at the logical plan, one only knows its name (built-in), the other also knows its physical representation `ScalarUDF`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-683374524


   FYI, this is what a PR would look like for the concatenate function: https://github.com/jorgecarleitao/arrow/pull/2/files using this API.
   
   I am not advocating that we follow this design (option 3 in [this comment](https://github.com/apache/arrow/pull/7967#issuecomment-683275858)) would also be fine.
   
   My point is that regardless of which option we pick, we will need to have the functionality in this PR:
   
   1. function signatures
   2. methods to check and coerce types based on those signatures
   3. methods to return the return type of the function


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-679631810


   > For built-in functions like `sqrt` I would expect DataFusion to provide convenience functions to create an expression, like we do with `col` and the aggregate functions. I assume we could also do that with the design proposed here?
   > 
   > For example, I would like to be able to write:
   > 
   > ```rust
   > df.select(vec![col("foo"), sqrt(col("bar"))])?
   > ```
   
   This PR does not support this, as it threats every function (built-in or not) equally. To include that case, IMO this PR needs to add a new enum in the logical `Expr`:
   
   * `Expr::ScalarFunction { name: String/Enum, args: Vec<Expr> }` that we logically know its return type based on `name` (e.g. `"sqrt"`), exactly like `Expr::BinaryExpr`. This is mapped to a physical expression during planning. These can be build without access to the registry, as we hard-code the return type on the logical plan to be consistent with the physical one, like we do for our aggregates, binary expressions, etc.
   * `Expr:ScalarUDF { fun: ScalarFunction, args: Vec<Expr> }`, whose return type is only known after going to the registry to check what the user set its return type to be (as this PR currently does).
   
   `sqrt` would return `Expr::ScalarFunction`, that knows its own return type, and the planner converts it to `ScalarFunction` via a hard-coded map, while `Expr:UDF`'s physical planning is just planning `args` and pass them to `ScalarFunction` like this PR already does.
   
   I.e. at the physical level, built-in and UDFs are indistinguishable, but at the logical plan, one only knows its name (built-in), the other also knows its physical representation `ScalarUDF`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-686533676


   I agree with you, @andygrove .
   
   * Built-in functions are not part of the registry as they are exposed directly from the `prelude`. So, maybe `udf_registry` could be more explicit about this.
   
   * I thought about `FunctionRegistry` being used for both udfs and UDAFs (user-defined aggregate function, à la [spark](https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html)), so that a user does not have to call `df.registry()` or `df.aggregate_registry()`, they just need to remember about one.
   
   * I thought about `FunctionRegistry::udf` and `FunctionRegistry::udaf`, just because short names make it easier to read statements. But, as always, it is a trade-off between short vs understanding.
   
   I have no strong opinions about naming nor UX here: I will implement whatever you agree upon :-). My main concern was to fix the data type thing :P


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-686291058


   After a long digression through the realm of built-ins, this has now been simplified and rebased against master.
   
   @andygrove and @alamb , ready for a re-review.
   
   Again, the core goal here is to allow users to use UDFs without having to worry about their return type.
   
   I've incorporated all points from @alamb and @andygrove so far:
   * `Registry` is now a trait
   * `Arc` is used consistently
   * built-ins continue to be available without access to the registry


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-683377384


   And finally, this is how we would add the `array` function, that receives an arbitrary but uniformly-typed number of arguments: https://github.com/jorgecarleitao/arrow/pull/3/files
   
   To summarize:
   
   * math functions: 
       * accept 1 argument of a fixed type (f32 or f64)
       * return a fixed type (f64)
   * `concat`:
       * accepts an arbitrary number of arguments of fixed type (utf8 atm, but large is trivial to add)
       * returns a fixed type (utf8)
   * `array`:
       * accepts an arbitrary number of arguments of variable but uniform types (i.e. all arguments must be of equal type)
       * returns `FixedSizeList(input_types[0], input.len())`
   
   The API that I am proposing here addresses all these cases out of the box. The 3 PRs in my repo,
   
   * all math accepts f32: https://github.com/jorgecarleitao/arrow/pull/1/files
   * `concatenate` for utf8: https://github.com/jorgecarleitao/arrow/pull/2/files
   * `array` for utf8: https://github.com/jorgecarleitao/arrow/pull/3/files
   
   add support to each of them at the physical and logical level.
   
   This PR also includes all the required coercion rules for this to work. E.g. `array(f32, i32, u32)` would be coerced to `array(f32, f32, f32)` (first argument dominates).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r482899692



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -1089,6 +1079,15 @@ pub fn can_coerce_from(type_into: &DataType, type_from: &DataType) -> bool {
     }
 }
 
+/// A registry knows how to build logical expressions out of user-defined function' names
+pub trait Registry {

Review comment:
       I wonder if renaming this to `FunctionRegistry` might make it a little more specific, without being overly verbose. I don't feel strongly about this change
   
   ```suggestion
   pub trait FunctionRegistry {
   ```

##########
File path: rust/datafusion/src/optimizer/type_coercion.rs
##########
@@ -22,34 +22,24 @@
 
 use arrow::datatypes::Schema;
 
-use crate::error::{ExecutionError, Result};
+use crate::error::Result;
 use crate::logical_plan::Expr;
 use crate::logical_plan::LogicalPlan;
 use crate::optimizer::optimizer::OptimizerRule;
 use crate::optimizer::utils;
-use crate::physical_plan::{
-    expressions::numerical_coercion, udf::ScalarFunctionRegistry,
-};
+use crate::physical_plan::expressions::numerical_coercion;
 use utils::optimize_explain;
 
 /// Optimizer that applies coercion rules to expressions in the logical plan.
 ///
 /// This optimizer does not alter the structure of the plan, it only changes expressions on it.
-pub struct TypeCoercionRule<'a, P>
-where
-    P: ScalarFunctionRegistry,
-{
-    scalar_functions: &'a P,
-}
+pub struct TypeCoercionRule {}

Review comment:
       ❤️  (for removing the reference)

##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -1184,19 +1183,14 @@ impl LogicalPlanBuilder {
     /// Apply a projection
     pub fn project(&self, expr: Vec<Expr>) -> Result<Self> {
         let input_schema = self.plan.schema();
-        let projected_expr = if expr.contains(&Expr::Wildcard) {
-            let mut expr_vec = vec![];
-            (0..expr.len()).for_each(|i| match &expr[i] {
-                Expr::Wildcard => {
-                    (0..input_schema.fields().len())
-                        .for_each(|i| expr_vec.push(col(input_schema.field(i).name())));
-                }
-                _ => expr_vec.push(expr[i].clone()),
-            });
-            expr_vec
-        } else {
-            expr.clone()
-        };
+        let mut projected_expr = vec![];

Review comment:
       I want to verify my understanding -- this is a code cleanup that is not directly required for UDFs, right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-678788344


   https://issues.apache.org/jira/browse/ARROW-9836


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r476706007



##########
File path: rust/datafusion/src/execution/physical_plan/functions.rs
##########
@@ -0,0 +1,208 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Declaration of built-in (scalar) functions.
+//! This module contains built-in functions' enumeration and metadata.
+
+use super::{expressions::cast, PhysicalExpr};
+use crate::error::{ExecutionError, Result};
+use crate::execution::physical_plan::math_expressions;
+use crate::execution::physical_plan::udf;
+use arrow::{
+    compute::kernels::length::length,
+    datatypes::{DataType, Schema},
+};
+use std::{fmt, str::FromStr, sync::Arc};
+use udf::ScalarUdf;
+
+/// Enum of all built-in scalar functions
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum ScalarFunction {
+    /// sqrt
+    Sqrt,
+    /// sin
+    Sin,
+    /// cos
+    Cos,
+    /// tan
+    Tan,
+    /// asin
+    Asin,
+    /// acos
+    Acos,
+    /// atan
+    Atan,
+    /// exp
+    Exp,
+    /// log, also known as ln
+    Log,
+    /// log2
+    Log2,
+    /// log10
+    Log10,
+    /// floor
+    Floor,
+    /// ceil
+    Ceil,
+    /// round
+    Round,
+    /// trunc
+    Trunc,
+    /// abs
+    Abs,
+    /// signum
+    Signum,
+    /// length
+    Length,
+}
+
+impl fmt::Display for ScalarFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        // lowercase of the debug.
+        write!(f, "{}", format!("{:?}", self).to_lowercase())
+    }
+}
+
+impl FromStr for ScalarFunction {
+    type Err = ExecutionError;
+    fn from_str(name: &str) -> Result<ScalarFunction> {
+        Ok(match name {
+            "sqrt" => ScalarFunction::Sqrt,
+            "sin" => ScalarFunction::Sin,
+            "cos" => ScalarFunction::Cos,
+            "tan" => ScalarFunction::Tan,
+            "asin" => ScalarFunction::Asin,
+            "acos" => ScalarFunction::Acos,
+            "atan" => ScalarFunction::Atan,
+            "exp" => ScalarFunction::Exp,
+            "log" => ScalarFunction::Log,
+            "log2" => ScalarFunction::Log2,
+            "log10" => ScalarFunction::Log10,
+            "floor" => ScalarFunction::Floor,
+            "ceil" => ScalarFunction::Ceil,
+            "round" => ScalarFunction::Round,
+            "truc" => ScalarFunction::Trunc,
+            "abs" => ScalarFunction::Abs,
+            "signum" => ScalarFunction::Signum,
+            "length" => ScalarFunction::Length,
+            _ => {
+                return Err(ExecutionError::General(format!(
+                    "There is no built-in function named {}",
+                    name
+                )))
+            }
+        })
+    }
+}
+
+/// Returns the datatype of the scalar function
+pub fn return_type(fun: &ScalarFunction, arg_types: &Vec<DataType>) -> Result<DataType> {
+    if arg_types.len() != 1 {
+        // for now, every function expects a single argument, and thus this is enough
+        return Err(ExecutionError::General(format!(
+            "The function \"{}\" expected 1 argument, but received \"{}\"",
+            fun,
+            arg_types.len()
+        )));
+    }
+
+    // verify that this is a valid type for this function
+    coerce(fun, &arg_types[0])?;
+
+    match fun {
+        ScalarFunction::Length => Ok(DataType::UInt32),
+        _ => Ok(DataType::Float64),
+    }
+}
+
+/// Create a physical (function) expression.
+pub fn function(
+    fun: &ScalarFunction,
+    args: Vec<Arc<dyn PhysicalExpr>>,
+    input_schema: &Schema,
+) -> Result<Arc<dyn PhysicalExpr>> {
+    let fun_expr: ScalarUdf = Arc::new(match fun {
+        ScalarFunction::Sqrt => math_expressions::sqrt,
+        ScalarFunction::Sin => math_expressions::sin,
+        ScalarFunction::Cos => math_expressions::cos,
+        ScalarFunction::Tan => math_expressions::tan,
+        ScalarFunction::Asin => math_expressions::asin,
+        ScalarFunction::Acos => math_expressions::acos,
+        ScalarFunction::Atan => math_expressions::atan,
+        ScalarFunction::Exp => math_expressions::exp,
+        ScalarFunction::Log => math_expressions::ln,
+        ScalarFunction::Log2 => math_expressions::log2,
+        ScalarFunction::Log10 => math_expressions::log10,
+        ScalarFunction::Floor => math_expressions::floor,
+        ScalarFunction::Ceil => math_expressions::ceil,
+        ScalarFunction::Round => math_expressions::round,
+        ScalarFunction::Trunc => math_expressions::trunc,
+        ScalarFunction::Abs => math_expressions::abs,
+        ScalarFunction::Signum => math_expressions::signum,
+        ScalarFunction::Length => |args| Ok(Arc::new(length(args[0].as_ref())?)),
+    });
+    let data_types = args
+        .iter()
+        .map(|e| e.data_type(input_schema))
+        .collect::<Result<Vec<_>>>()?;
+
+    // coerce type
+    // for now, this works only for 1 argument.
+    assert!(args.len() == 1);
+    assert!(data_types.len() == 1);
+    let new_type = coerce(fun, &data_types[0])?;
+    let args = vec![cast(args[0].clone(), input_schema, new_type)?];
+
+    Ok(Arc::new(udf::ScalarFunctionExpr::new(
+        &format!("{}", fun),
+        &fun_expr,
+        args,
+        &return_type(&fun, &data_types)?,
+    )))
+}
+
+/// the set of valid types supported by the function
+pub fn valid_type(fun: &ScalarFunction) -> DataType {
+    match fun {
+        ScalarFunction::Length => DataType::Utf8,
+        _ => DataType::Float64,

Review comment:
       This is where we will be able to declare multiple valid input types, if this function changes to `Vec<Vec<DataType>>` (multiple types, multiple arguments).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r482988513



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -1089,6 +1079,15 @@ pub fn can_coerce_from(type_into: &DataType, type_from: &DataType) -> bool {
     }
 }
 
+/// A registry knows how to build logical expressions out of user-defined function' names
+pub trait Registry {

Review comment:
       [Explicit is better than implicit](https://www.python.org/dev/peps/pep-0020/#id2) (done)

##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -1184,19 +1183,14 @@ impl LogicalPlanBuilder {
     /// Apply a projection
     pub fn project(&self, expr: Vec<Expr>) -> Result<Self> {
         let input_schema = self.plan.schema();
-        let projected_expr = if expr.contains(&Expr::Wildcard) {
-            let mut expr_vec = vec![];
-            (0..expr.len()).for_each(|i| match &expr[i] {
-                Expr::Wildcard => {
-                    (0..input_schema.fields().len())
-                        .for_each(|i| expr_vec.push(col(input_schema.field(i).name())));
-                }
-                _ => expr_vec.push(expr[i].clone()),
-            });
-            expr_vec
-        } else {
-            expr.clone()
-        };
+        let mut projected_expr = vec![];

Review comment:
       Indirectly, it is related: `expr.contains(&Expr::Wildcard)` requires `PartialEq`, which was dropped [in this line](https://github.com/apache/arrow/pull/8032/files/d2ff9c9b838e68d98d76b1f18c3721f2fa7e411f#diff-9922d86c805a8dc858387eb47612caaeL227) due to the addition of an `Arc<ScalarFunction>` in [this line](https://github.com/apache/arrow/pull/8032/files/d2ff9c9b838e68d98d76b1f18c3721f2fa7e411f#diff-9922d86c805a8dc858387eb47612caaeR279).
   
   However, since `if expr.contains(&Expr::Wildcard)` thankfully was entirely optional, I dropped it ^_^




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r478319729



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -1042,6 +998,35 @@ pub fn can_coerce_from(type_into: &DataType, type_from: &DataType) -> bool {
     }
 }
 
+/// A registry of functions used to plan queries programmatically

Review comment:
       I wonder if you could implement UDFFactory on `DataFrameImpl` as well -- something like (untested)
   
   ```
   impl UDFFactory for DataFrameImpl {
       fn udf(&self, name: &str, args: Vec<Expr>) -> Result<Expr> {
       self.ctx_state.lock().expect("locked the mutex").udf(name, args)
       }
   ```
   
   I think this is another good example of why it would be helpful to remove the the Mutex from ExecutionContextState. I'll keep working on that -- I apologize for my slowness this week. I'll be back full time next week and hopefully make faster progress




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r483001402



##########
File path: rust/datafusion/src/dataframe.rs
##########
@@ -188,4 +188,19 @@ pub trait DataFrame {
     /// # }
     /// ```
     fn explain(&self, verbose: bool) -> Result<Arc<dyn DataFrame>>;
+
+    /// Return a `FunctionRegistry` used to plan udf's calls
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # fn main() -> Result<()> {
+    /// let mut ctx = ExecutionContext::new();
+    /// let df = ctx.read_csv("tests/example.csv", CsvReadOptions::new())?;
+    /// let f = df.registry();
+    /// // use f.udf("name", vec![...]) to use the udf
+    /// # Ok(())
+    /// # }
+    /// ```
+    fn registry(&self) -> &dyn FunctionRegistry;

Review comment:
       Is this registry specific to scalar functions or will it also be used for aggregate functions? Perhaps we should name the method either `function_registry` or `scalar_function_registry`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-688499191


   I think so, @andygrove . There is probably some renaming once we have UDAFs. For now, I think it is fine.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r477045153



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -1042,6 +998,35 @@ pub fn can_coerce_from(type_into: &DataType, type_from: &DataType) -> bool {
     }
 }
 
+/// A registry of functions used to plan queries programmatically

Review comment:
       That is an excellent idea. I am trying to implement it, but I am facing a challenge; could you help me here?
   
   I've implemented the trait and moved the implementation for the `ExecutionContextState`, which is where the scalar is and what the DataFrame has access to:
   
   ```
   impl UDFFactory for ExecutionContextState {
       fn udf(&self, name: &str, args: Vec<Expr>) -> Result<Expr> {
       ...
       }
   ```
   
   On the dataframe, we will need to do something like 
   
   ```
   fn udfs(&self) -> &dyn UDFFactory {
           self.ctx_state ...
       }
   ```
   
   but `self.ctx_state` is under a mutex: `Arc<Mutex<ExecutionContextState>>`.
   
   This seems fair (safety wise): another thread could be trying to register a new UDF while we are trying to use one, in which case one of them should wait.
   
   However, I am struggling to write the interface to handle this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r477045153



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -1042,6 +998,35 @@ pub fn can_coerce_from(type_into: &DataType, type_from: &DataType) -> bool {
     }
 }
 
+/// A registry of functions used to plan queries programmatically

Review comment:
       That is an excellent idea. I am trying to implement it, but I am facing a challenge; could you help me here?
   
   I've implemented the trait and moved the implementation for the `ExecutionContextState`, which is where the scalar is and what the DataFrame has access to:
   
   ```
   impl UDFFactory for ExecutionContextState {
       fn udf(&self, name: &str, args: Vec<Expr>) -> Result<Expr> {
       ...
       }
   ```
   
   On the dataframe, we will need to do something like 
   
   ```
   fn registry(&self) -> &dyn UDFFactory {
           self.ctx_state ...
       }
   ```
   
   but `self.ctx_state` is under a mutex: `Arc<Mutex<ExecutionContextState>>`.
   
   This seems fair (safety wise): another thread could be trying to register a new UDF while we are trying to use one, in which case one of them should wait.
   
   However, I am struggling to write the interface to handle this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-680037941


   For anyone else reading along, the associated document I think is https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-688498545


   @jorgecarleitao @alamb I'm catching up on the PRs today. It looks like this one is ready to merge?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-683255964


   FYI, this is what a PR to add support for f32 to mathematical expressions (keeping the return type f64) looks like: 
   https://github.com/jorgecarleitao/arrow/pull/1
   
   I.e. IMO with this PR we can support almost any built-in function: fixed type, variable return type, multiple input types, etc on `Expr::ScalarFunction`, which gives a lot of flexibility to add a new function, as we do not need to fiddle with `Expr`, only with input types, return types, etc. on the physical plane.
   
   I split built-ins from the UDFs because built-ins type is known without access to the registry, which is currently required to allow users to use them outside the `registry`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r476703332



##########
File path: rust/datafusion/src/execution/dataframe_impl.rs
##########
@@ -232,6 +238,23 @@ mod tests {
         Ok(())
     }
 
+    #[test]
+    fn registry() -> Result<()> {
+        // build query with a UDF using DataFrame API

Review comment:
       this is now useless, as built-in functions no longer require the registry. We should register one and use it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r476466084



##########
File path: rust/datafusion/src/execution/context.rs
##########
@@ -193,6 +191,17 @@ impl ExecutionContext {
         state.scalar_functions.insert(f.name.clone(), Arc::new(f));
     }
 
+    /// Get a reference to the registered scalar functions
+    pub fn scalar_functions(&self) -> Vec<String> {

Review comment:
       This is not returning a reference, it is returning a copy 

##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -1042,6 +998,35 @@ pub fn can_coerce_from(type_into: &DataType, type_from: &DataType) -> bool {
     }
 }
 
+/// A registry of functions used to plan queries programmatically

Review comment:
       I wonder what benefit we get from having Registry be its own `struct` and copying the vec around.
   
   What do you think about something like using a traint. Something like 
   
   ```
   pub trait UDFFactory {
       pub fn udf(&self, name: &str, args: Vec<Expr>) -> Result<Expr>;
   }
   ```
   
   And then move the implementation of `udf` directly into ExecutionContext. 
   
   
   
   
   

##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -260,12 +261,10 @@ pub enum Expr {
     },
     /// scalar function
     ScalarFunction {
-        /// Name of the function
-        name: String,
+        /// The function
+        fun: Box<ScalarFunction>,

Review comment:
       ```suggestion
           fun: Arc<ScalarFunction>,
   ```
   
   I think you can avoid deep copying the `ScalarFunction`  if we use `Arc` consistently

##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -527,9 +527,8 @@ impl<'a, S: SchemaProvider> SqlToRel<'a, S> {
                             }
 
                             Ok(Expr::ScalarFunction {
-                                name: name.clone(),
+                                fun: Box::new(fm.as_ref().clone()),

Review comment:
       I think if `fun` was an `Arc` this line will look like `fun: fm.clone()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r476704869



##########
File path: rust/datafusion/src/execution/physical_plan/functions.rs
##########
@@ -0,0 +1,208 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Declaration of built-in (scalar) functions.
+//! This module contains built-in functions' enumeration and metadata.
+
+use super::{expressions::cast, PhysicalExpr};
+use crate::error::{ExecutionError, Result};
+use crate::execution::physical_plan::math_expressions;
+use crate::execution::physical_plan::udf;
+use arrow::{
+    compute::kernels::length::length,
+    datatypes::{DataType, Schema},
+};
+use std::{fmt, str::FromStr, sync::Arc};
+use udf::ScalarUdf;
+
+/// Enum of all built-in scalar functions
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum ScalarFunction {
+    /// sqrt
+    Sqrt,
+    /// sin
+    Sin,
+    /// cos
+    Cos,
+    /// tan
+    Tan,
+    /// asin
+    Asin,
+    /// acos
+    Acos,
+    /// atan
+    Atan,
+    /// exp
+    Exp,
+    /// log, also known as ln
+    Log,
+    /// log2
+    Log2,
+    /// log10
+    Log10,
+    /// floor
+    Floor,
+    /// ceil
+    Ceil,
+    /// round
+    Round,
+    /// trunc
+    Trunc,
+    /// abs
+    Abs,
+    /// signum
+    Signum,
+    /// length
+    Length,
+}
+
+impl fmt::Display for ScalarFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        // lowercase of the debug.
+        write!(f, "{}", format!("{:?}", self).to_lowercase())
+    }
+}
+
+impl FromStr for ScalarFunction {
+    type Err = ExecutionError;
+    fn from_str(name: &str) -> Result<ScalarFunction> {
+        Ok(match name {
+            "sqrt" => ScalarFunction::Sqrt,
+            "sin" => ScalarFunction::Sin,
+            "cos" => ScalarFunction::Cos,
+            "tan" => ScalarFunction::Tan,
+            "asin" => ScalarFunction::Asin,
+            "acos" => ScalarFunction::Acos,
+            "atan" => ScalarFunction::Atan,
+            "exp" => ScalarFunction::Exp,
+            "log" => ScalarFunction::Log,
+            "log2" => ScalarFunction::Log2,
+            "log10" => ScalarFunction::Log10,
+            "floor" => ScalarFunction::Floor,
+            "ceil" => ScalarFunction::Ceil,
+            "round" => ScalarFunction::Round,
+            "truc" => ScalarFunction::Trunc,
+            "abs" => ScalarFunction::Abs,
+            "signum" => ScalarFunction::Signum,
+            "length" => ScalarFunction::Length,
+            _ => {
+                return Err(ExecutionError::General(format!(
+                    "There is no built-in function named {}",
+                    name
+                )))
+            }
+        })
+    }
+}
+
+/// Returns the datatype of the scalar function
+pub fn return_type(fun: &ScalarFunction, arg_types: &Vec<DataType>) -> Result<DataType> {
+    if arg_types.len() != 1 {
+        // for now, every function expects a single argument, and thus this is enough
+        return Err(ExecutionError::General(format!(
+            "The function \"{}\" expected 1 argument, but received \"{}\"",
+            fun,
+            arg_types.len()
+        )));
+    }
+
+    // verify that this is a valid type for this function
+    coerce(fun, &arg_types[0])?;
+
+    match fun {
+        ScalarFunction::Length => Ok(DataType::UInt32),
+        _ => Ok(DataType::Float64),
+    }
+}
+
+/// Create a physical (function) expression.
+pub fn function(
+    fun: &ScalarFunction,
+    args: Vec<Arc<dyn PhysicalExpr>>,
+    input_schema: &Schema,
+) -> Result<Arc<dyn PhysicalExpr>> {
+    let fun_expr: ScalarUdf = Arc::new(match fun {
+        ScalarFunction::Sqrt => math_expressions::sqrt,
+        ScalarFunction::Sin => math_expressions::sin,
+        ScalarFunction::Cos => math_expressions::cos,
+        ScalarFunction::Tan => math_expressions::tan,
+        ScalarFunction::Asin => math_expressions::asin,
+        ScalarFunction::Acos => math_expressions::acos,
+        ScalarFunction::Atan => math_expressions::atan,
+        ScalarFunction::Exp => math_expressions::exp,
+        ScalarFunction::Log => math_expressions::ln,
+        ScalarFunction::Log2 => math_expressions::log2,
+        ScalarFunction::Log10 => math_expressions::log10,
+        ScalarFunction::Floor => math_expressions::floor,
+        ScalarFunction::Ceil => math_expressions::ceil,
+        ScalarFunction::Round => math_expressions::round,
+        ScalarFunction::Trunc => math_expressions::trunc,
+        ScalarFunction::Abs => math_expressions::abs,
+        ScalarFunction::Signum => math_expressions::signum,
+        ScalarFunction::Length => |args| Ok(Arc::new(length(args[0].as_ref())?)),
+    });
+    let data_types = args
+        .iter()
+        .map(|e| e.data_type(input_schema))
+        .collect::<Result<Vec<_>>>()?;
+
+    // coerce type
+    // for now, this works only for 1 argument.
+    assert!(args.len() == 1);
+    assert!(data_types.len() == 1);
+    let new_type = coerce(fun, &data_types[0])?;
+    let args = vec![cast(args[0].clone(), input_schema, new_type)?];

Review comment:
       we have this coercion rule in another PR, which will expand this to multiple arguments.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-679327189


   For built-in functions like `sqrt` I would expect DataFusion to provide convenience functions to create an expression, like we do with `col` and the aggregate functions. I assume we could also do that with the design proposed here?
   
   For example, I would like to be able to write:
   
   ```rust
   df.select(vec![col("foo"), sqrt(col("bar"))])?
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-679631810


   > For built-in functions like `sqrt` I would expect DataFusion to provide convenience functions to create an expression, like we do with `col` and the aggregate functions. I assume we could also do that with the design proposed here?
   > 
   > For example, I would like to be able to write:
   > 
   > ```rust
   > df.select(vec![col("foo"), sqrt(col("bar"))])?
   > ```
   
   This PR does not support this, as it threats every function (built-in or not) equally. To include that case, IMO this PR needs to add a new enum in the logical `Expr`:
   
   * `Expr::ScalarFunction { name: String/Enum, args: Vec<Expr> }` that we logically know its return type based on `name` (e.g. `"sqrt"`), exactly like `Expr::BinaryExpr`. This is mapped to a physical expression during planning. These can be build without access to the registry, as we hard-cod the return type on the logical plan to be consistent with the physical one, like we do for our aggregates, binary expressions, etc.
   * `Expr:ScalarUDF { fun: ScalarFunction, args: Vec<Expr> }`, whose return type is only known after going to the registry to check what the user set its return type to be (as this PR currently does).
   
   `sqrt` would return `Expr::ScalarFunction`, that knows its own return type, and the planner converts it to `ScalarFunction` via a hard-coded map, while `Expr:UDF`'s physical planning is just planning `args` and pass them to `ScalarFunction` like this PR already does.
   
   I.e. at the physical level, built-in and UDFs are indistinguishable, but at the logical plan, one only knows its name (built-in), the other also knows its physical representation `ScalarUDF`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8032:
URL: https://github.com/apache/arrow/pull/8032


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#issuecomment-681867726


   Good luck -- at this stage of a project (when architecture is changing a
   bunch) know it is hard to make small / easy to review PRs. I hope the
   comments are helpful and I am sorry I don't have more time to devote to
   reviews.
   
   On Tue, Aug 25, 2020 at 4:10 PM Jorge Leitao <no...@github.com>
   wrote:
   
   > @alamb <https://github.com/alamb> thank you very much for your comments,
   > I will now work on addressing them now. I still learning the Arc/Box/Ref,
   > so thank you a lot for also teaching me.
   >
   > @andygrove <https://github.com/andygrove> , I agree with you that
   > built-in functions should not require access to the registry.
   > Unfortunately, doing so required some re-work, which is the reason I
   > retracted #7967 <https://github.com/apache/arrow/pull/7967> back to draft
   > to focus on this one first.
   >
   > I pushed a new commit to this PR to address this point. Specifically, that
   > commit adds:
   >
   >    - a new enum with all built-in functions
   >    - functionally gluing the logical plan with the physical plan so that
   >    the function's return types are invariant.
   >    - made type coercion on built-in functions to be on the physical
   >    plane, to preserve schema invariance during planning.
   >
   > I am pretty happy with this PR, as IMO has the flexibility we need to
   > expand DataFusion's pool of built-in functions to multiple input and return
   > types. The main features of this PR:
   >
   >    - users no longer have to pass the return type of the UDF when calling
   >    them (the proposal)
   >    - planning built-in functions continue to not need access to the
   >    registry (@andygrove <https://github.com/andygrove> 's point)
   >    - built-in functions now support multiple input types (e.g. sqrt(f32),
   >    sqrt(f64))
   >    - built-in functions now support multiple return types (e.g. sqrt(f32)
   >    -> f32, sqrt(f64) -> f64)
   >    - coercion rules are no longer applied in the sql planning or physical
   >    planning to built-in functions, to avoid breaking schema invariance during
   >    planning
   >
   > I have not completed the valid return types of built-in math functions as
   > this PR was already too long.
   >
   > Overall, I think that this has not been a pleasant experience for you
   > @andygrove <https://github.com/andygrove> and @alamb
   > <https://github.com/alamb>, as I constantly open and close PRs around
   > functions/UDFs, and for that I am really sorry. I've been hitting some
   > design challenge after another, which requires me to go back and forth.
   >
   > I am still in pursuit of my original quests:
   >
   >    - built-in aggregate functions whose logical types are known from the
   >    physical expressions
   >    - type coercion on aggregate functions
   >    - built-in aggregate functions whose return types (e.g. min(f32) -> f32,
   >    min(f64) -> f64) are directly derived from the physical plan (there is
   >    an old fixme/todo in the code around that)
   >    - aggregate udfs
   >    - udfs with multiple incoming and return types, to bring them to the
   >    same level of functionality of built-ins
   >    - planning a udf without registering it (a-la spark) in the DF's API.
   >
   > I have code for some of this, I... just... need... to... finish... the...
   > scalar... stuff... first... 😃
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/pull/8032#issuecomment-680244598>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AADXZMLPUU3TGXRLTCKWWKLSCQLEFANCNFSM4QIW6KXQ>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8032: ARROW-9836: [Rust][DataFusion] Improve API for usage of UDFs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8032:
URL: https://github.com/apache/arrow/pull/8032#discussion_r476704464



##########
File path: rust/datafusion/src/execution/physical_plan/functions.rs
##########
@@ -0,0 +1,208 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Declaration of built-in (scalar) functions.
+//! This module contains built-in functions' enumeration and metadata.
+
+use super::{expressions::cast, PhysicalExpr};
+use crate::error::{ExecutionError, Result};
+use crate::execution::physical_plan::math_expressions;
+use crate::execution::physical_plan::udf;
+use arrow::{
+    compute::kernels::length::length,
+    datatypes::{DataType, Schema},
+};
+use std::{fmt, str::FromStr, sync::Arc};
+use udf::ScalarUdf;
+
+/// Enum of all built-in scalar functions
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum ScalarFunction {
+    /// sqrt
+    Sqrt,
+    /// sin
+    Sin,
+    /// cos
+    Cos,
+    /// tan
+    Tan,
+    /// asin
+    Asin,
+    /// acos
+    Acos,
+    /// atan
+    Atan,
+    /// exp
+    Exp,
+    /// log, also known as ln
+    Log,
+    /// log2
+    Log2,
+    /// log10
+    Log10,
+    /// floor
+    Floor,
+    /// ceil
+    Ceil,
+    /// round
+    Round,
+    /// trunc
+    Trunc,
+    /// abs
+    Abs,
+    /// signum
+    Signum,
+    /// length
+    Length,
+}
+
+impl fmt::Display for ScalarFunction {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        // lowercase of the debug.
+        write!(f, "{}", format!("{:?}", self).to_lowercase())
+    }
+}
+
+impl FromStr for ScalarFunction {
+    type Err = ExecutionError;
+    fn from_str(name: &str) -> Result<ScalarFunction> {
+        Ok(match name {
+            "sqrt" => ScalarFunction::Sqrt,
+            "sin" => ScalarFunction::Sin,
+            "cos" => ScalarFunction::Cos,
+            "tan" => ScalarFunction::Tan,
+            "asin" => ScalarFunction::Asin,
+            "acos" => ScalarFunction::Acos,
+            "atan" => ScalarFunction::Atan,
+            "exp" => ScalarFunction::Exp,
+            "log" => ScalarFunction::Log,
+            "log2" => ScalarFunction::Log2,
+            "log10" => ScalarFunction::Log10,
+            "floor" => ScalarFunction::Floor,
+            "ceil" => ScalarFunction::Ceil,
+            "round" => ScalarFunction::Round,
+            "truc" => ScalarFunction::Trunc,
+            "abs" => ScalarFunction::Abs,
+            "signum" => ScalarFunction::Signum,
+            "length" => ScalarFunction::Length,
+            _ => {
+                return Err(ExecutionError::General(format!(
+                    "There is no built-in function named {}",
+                    name
+                )))
+            }
+        })
+    }
+}
+
+/// Returns the datatype of the scalar function
+pub fn return_type(fun: &ScalarFunction, arg_types: &Vec<DataType>) -> Result<DataType> {
+    if arg_types.len() != 1 {
+        // for now, every function expects a single argument, and thus this is enough
+        return Err(ExecutionError::General(format!(
+            "The function \"{}\" expected 1 argument, but received \"{}\"",
+            fun,
+            arg_types.len()
+        )));
+    }
+
+    // verify that this is a valid type for this function
+    coerce(fun, &arg_types[0])?;
+
+    match fun {
+        ScalarFunction::Length => Ok(DataType::UInt32),
+        _ => Ok(DataType::Float64),

Review comment:
       for now, but here is where the beauty will happen: as long as our physical plan supports it, we can return other stuff here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org