You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/18 13:30:54 UTC

[GitHub] [arrow] andygrove opened a new pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

andygrove opened a new pull request #7988:
URL: https://github.com/apache/arrow/pull/7988


   This is a step towards cleaning up the handling of aggregate expressions and acheives the following:
   
   - It is now possible to construct a logical aggregate expression without specifying a return data type, so we can now just say `min(col("foo"))` for example.
   - Removes the aggregate function methods from the `DataFrame` trait since they are no longer needed
   - Removes duplicate logic from `expr_to_field` that was determining the data type of expressions rather than just calling `expr.get_type`
   - Removes hard-coded return types from tests
   - Adds the new min, max, avg, count, sum functions to prelude.rs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7988:
URL: https://github.com/apache/arrow/pull/7988#issuecomment-675480647


   https://issues.apache.org/jira/browse/ARROW-9783


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #7988:
URL: https://github.com/apache/arrow/pull/7988


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #7988:
URL: https://github.com/apache/arrow/pull/7988#discussion_r472217735



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -379,7 +349,48 @@ impl Expr {
             Expr::Literal(l) => l.get_datatype(),
             Expr::Cast { data_type, .. } => Ok(data_type.clone()),
             Expr::ScalarFunction { return_type, .. } => Ok(return_type.clone()),
-            Expr::AggregateFunction { return_type, .. } => Ok(return_type.clone()),
+            Expr::AggregateFunction { name, args, .. } => {
+                match name.to_uppercase().as_str() {
+                    "MIN" | "MAX" => args[0].get_type(schema),
+                    "SUM" => match args[0].get_type(schema)? {
+                        DataType::Int8
+                        | DataType::Int16
+                        | DataType::Int32
+                        | DataType::Int64 => Ok(DataType::Int64),
+                        DataType::UInt8
+                        | DataType::UInt16
+                        | DataType::UInt32
+                        | DataType::UInt64 => Ok(DataType::UInt64),
+                        DataType::Float32 => Ok(DataType::Float32),

Review comment:
       I don't know. I just copied over the logic from the physical implementation of these expressions for consistency. Maybe the next step here is for me to create a PR to unify the code somehow between the logical and physical expressions.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #7988:
URL: https://github.com/apache/arrow/pull/7988#issuecomment-675480952


   @jorgecarleitao @alamb Could I get a review please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #7988:
URL: https://github.com/apache/arrow/pull/7988#discussion_r472213122



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -363,8 +335,6 @@ pub enum Expr {
         name: String,
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
-        /// The `DataType` the expression will yield
-        return_type: DataType,

Review comment:
       I think (eventually) `AggregateFunction` store its return type will be important for user defined aggregates, but for now hard coding the rules to compute the output type based on the input type for built in aggregates is a good idea.
   
   Perhaps we can special case UDAs in a follow on PR with something such as `UserDefinedAggregateFunction` rather than making `AggregateFunction` work for both built in aggregates and user defined ones 

##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -379,7 +349,48 @@ impl Expr {
             Expr::Literal(l) => l.get_datatype(),
             Expr::Cast { data_type, .. } => Ok(data_type.clone()),
             Expr::ScalarFunction { return_type, .. } => Ok(return_type.clone()),
-            Expr::AggregateFunction { return_type, .. } => Ok(return_type.clone()),
+            Expr::AggregateFunction { name, args, .. } => {
+                match name.to_uppercase().as_str() {
+                    "MIN" | "MAX" => args[0].get_type(schema),
+                    "SUM" => match args[0].get_type(schema)? {
+                        DataType::Int8
+                        | DataType::Int16
+                        | DataType::Int32
+                        | DataType::Int64 => Ok(DataType::Int64),
+                        DataType::UInt8
+                        | DataType::UInt16
+                        | DataType::UInt32
+                        | DataType::UInt64 => Ok(DataType::UInt64),
+                        DataType::Float32 => Ok(DataType::Float32),

Review comment:
       why not Float64 here (as Int32 --> Int64 for integers)?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #7988: ARROW-9783: [Rust] [DataFusion] Remove aggregate expression data type

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #7988:
URL: https://github.com/apache/arrow/pull/7988#discussion_r472221795



##########
File path: rust/datafusion/src/logicalplan.rs
##########
@@ -363,8 +335,6 @@ pub enum Expr {
         name: String,
         /// List of expressions to feed to the functions as arguments
         args: Vec<Expr>,
-        /// The `DataType` the expression will yield
-        return_type: DataType,

Review comment:
       I agree with this: we need two interfaces: one internal that supports variable types, so that we abstract out the hard-coded return types and make them functions of the expression, and a public one for users to declared fixed-type UDFs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org