You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/16 10:39:16 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary types

jorgecarleitao opened a new pull request #7974:
URL: https://github.com/apache/arrow/pull/7974


   This PR is done on top of #7971 , 
   
   Its current runtime consequence is that all our math functions now return float32 or float64 depending on their incoming column (and use float32 for other numeric types). Its API consequences is that it allows to register UDFs of variable return types.
   
   This PR hits both physical plans and logical plans, and abstracts a bit the typing of some of our structs. In particular, it introduces a new trait that only cares about the return datatype of an object (PhysicalExpr and LogicalExpr).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471097263



##########
File path: rust/datafusion/src/execution/physical_plan/udf.rs
##########
@@ -87,7 +95,7 @@ impl Debug for ScalarFunctionExpr {
             .field("fun", &"<FUNC>")
             .field("name", &self.name)
             .field("args", &self.args)
-            .field("return_type", &self.return_type)
+            //.field("return_type", &self.return_type)

Review comment:
       todo




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#issuecomment-674510841


   https://issues.apache.org/jira/browse/ARROW-9756


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471718884



##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -476,70 +478,50 @@ impl<S: SchemaProvider> SqlToRel<S> {
             }
 
             SQLExpr::Function(function) => {
-                //TODO: fix this hack
-                let name: String = function.name.to_string();
-                match name.to_lowercase().as_ref() {
-                    "min" | "max" | "sum" | "avg" => {
-                        let rex_args = function
-                            .args
-                            .iter()
-                            .map(|a| self.sql_to_rex(a, schema))
-                            .collect::<Result<Vec<Expr>>>()?;
-
-                        // return type is same as the argument type for these aggregate
-                        // functions
-                        let return_type = rex_args[0].get_type(schema)?.clone();
-
-                        Ok(Expr::AggregateFunction {
-                            name: name.clone(),
-                            args: rex_args,
-                            return_type,
-                        })
-                    }
-                    "count" => {
-                        let rex_args = function
-                            .args
-                            .iter()
-                            .map(|a| match a {
-                                SQLExpr::Value(Value::Number(_)) => Ok(lit(1_u8)),
-                                SQLExpr::Wildcard => Ok(lit(1_u8)),
-                                _ => self.sql_to_rex(a, schema),
-                            })
-                            .collect::<Result<Vec<Expr>>>()?;
-
-                        Ok(Expr::AggregateFunction {
-                            name: name.clone(),
-                            args: rex_args,
-                            return_type: DataType::UInt64,
-                        })
-                    }
-                    _ => match self.schema_provider.get_function_meta(&name) {
-                        Some(fm) => {
-                            let rex_args = function
+                // make the search case-insensitive
+                let name: String = function.name.to_string().to_lowercase();
+
+                match self.schema_provider.get_function_meta(&name) {
+                    Some(fm) => {
+                        let args = if name == "count" {
+                            // optimization to avoid computing expressions

Review comment:
       This should probably be moved to another place (an optimizer?). For now I left it where it here as it was.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471097047



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -1087,7 +1101,7 @@ impl LogicalPlanBuilder {
     /// Apply a projection
     pub fn project(&self, expr: Vec<Expr>) -> Result<Self> {
         let input_schema = self.plan.schema();
-        let projected_expr = if expr.contains(&Expr::Wildcard) {

Review comment:
       I had to drop partialEq for expressions, and therefore also dropped this since it was strictly not needed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471708549



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -355,7 +355,7 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
         /// The `DataType` the expression will yield
-        return_type: DataType,
+        return_type: ReturnType,

Review comment:
       THIS IS A MAJOR CHANGE: it makes Expr hold an `Arc<&dyn>`, which make them unserializable. I was unable to find another way, unfortunately.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#issuecomment-674510486


   FYI @andygrove @alamb @houqp : this is a draft because it depends on other PRs being reviewed and accepted.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471704229



##########
File path: rust/datafusion/src/execution/physical_plan/expressions.rs
##########
@@ -156,7 +168,7 @@ macro_rules! sum_accumulate {
 
 #[derive(Debug)]
 struct SumAccumulator {
-    sum: Option<ScalarValue>,
+    pub sum: Option<ScalarValue>,

Review comment:
       not needed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471097271



##########
File path: rust/datafusion/src/execution/physical_plan/udf.rs
##########
@@ -50,7 +58,7 @@ impl Debug for ScalarFunction {
         f.debug_struct("ScalarFunction")
             .field("name", &self.name)
             .field("args", &self.args)
-            .field("return_type", &self.return_type)
+            //.field("return_type", &self.return_type)

Review comment:
       todo




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471097047



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -1087,7 +1101,7 @@ impl LogicalPlanBuilder {
     /// Apply a projection
     pub fn project(&self, expr: Vec<Expr>) -> Result<Self> {
         let input_schema = self.plan.schema();
-        let projected_expr = if expr.contains(&Expr::Wildcard) {

Review comment:
       I had to drop partialEq for expressions, and therefore also dropped this if that was strictly not needed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471707581



##########
File path: rust/datafusion/src/lib.rs
##########
@@ -31,6 +31,7 @@ extern crate sqlparser;
 
 pub mod dataframe;
 pub mod datasource;
+mod datatyped;

Review comment:
       pub so that others can implement a different DataTyped?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7974:
URL: https://github.com/apache/arrow/pull/7974#discussion_r471708914



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -364,22 +364,25 @@ pub enum Expr {
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
         /// The `DataType` the expression will yield
-        return_type: DataType,
+        return_type: ReturnType,
     },
     /// Wildcard
     Wildcard,
 }
 
-impl Expr {
-    /// Find the `DataType` for the expression
-    pub fn get_type(&self, schema: &Schema) -> Result<DataType> {
+impl DataTyped for Expr {
+    fn get_type(&self, schema: &Schema) -> Result<DataType> {
         match self {
             Expr::Alias(expr, _) => expr.get_type(schema),
             Expr::Column(name) => Ok(schema.field_with_name(name)?.data_type().clone()),
             Expr::Literal(l) => l.get_datatype(),
             Expr::Cast { data_type, .. } => Ok(data_type.clone()),
-            Expr::ScalarFunction { return_type, .. } => Ok(return_type.clone()),
-            Expr::AggregateFunction { return_type, .. } => Ok(return_type.clone()),
+            Expr::ScalarFunction {
+                args, return_type, ..
+            } => return_type(&args.iter().map(|x| x.as_datatyped()).collect(), schema),

Review comment:
       not very beautiful, but was unable to find a better pattern for this :(




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao closed pull request #7974: ARROW-9756: [Rust] [DataFusion] Added support for scalar UDFs of arbitrary return types

Posted by GitBox <gi...@apache.org>.
jorgecarleitao closed pull request #7974:
URL: https://github.com/apache/arrow/pull/7974


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org