You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "allenma (via GitHub)" <gi...@apache.org> on 2023/04/07 07:11:51 UTC

[GitHub] [arrow-datafusion] allenma opened a new pull request, #5908: Count agg support multiple expressions

allenma opened a new pull request, #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5619
   
   # Rationale for this change
   Most of other sql engines like pg, mysql, spark support count for multiple expressions like: select count(col1, col2) from table1;
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   Support count for multiple expressions, when any expression return null for each row, the row will not be counted. 
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   Yes, add new ut and e2e test
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   No
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] allenma commented on a diff in pull request #5908: Count agg support multiple expressions

Posted by "allenma (via GitHub)" <gi...@apache.org>.
allenma commented on code in PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#discussion_r1161051932


##########
datafusion/physical-expr/src/aggregate/count.rs:
##########
@@ -53,11 +55,43 @@ impl Count {
     ) -> Self {
         Self {
             name: name.into(),
-            expr,
+            exprs: vec![expr],
             data_type,
             nullable: true,
         }
     }
+
+    pub fn new_with_multiple_exprs(
+        exprs: Vec<Arc<dyn PhysicalExpr>>,
+        name: impl Into<String>,
+        data_type: DataType,
+    ) -> Self {
+        Self {
+            name: name.into(),
+            exprs,
+            data_type,
+            nullable: true,
+        }
+    }
+}
+
+/// count null values for multiple columns
+/// for each row if one column value is null, then null_count + 1
+fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize {

Review Comment:
   Thanks, we can refactor it in the future if we find a better way



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] allenma commented on pull request #5908: Count agg support multiple expressions

Posted by "allenma (via GitHub)" <gi...@apache.org>.
allenma commented on PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#issuecomment-1501642796

   > #5619
   
   @mingmwang Currently count distinct also don't support multiple expressions, will create another pr to address this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] yahoNanJing merged pull request #5908: Count agg support multiple expressions

Posted by "yahoNanJing (via GitHub)" <gi...@apache.org>.
yahoNanJing merged PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on pull request #5908: Count agg support multiple expressions

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#issuecomment-1500784203

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #5908: Count agg support multiple expressions

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#issuecomment-1501453950

   @allenma 
   Would you please add one more UT to cover the count multiple distinct cases ?  You can do this in the following PR.
   
   `select count(distinct col1, col2) from demo;`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] allenma commented on pull request #5908: Count agg support multiple expressions

Posted by "allenma (via GitHub)" <gi...@apache.org>.
allenma commented on PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#issuecomment-1500772310

   Thank you @ozankabak for your careful review, have addressed all your comments, please confirm


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #5908: Count agg support multiple expressions

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#issuecomment-1501365216

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5908: Count agg support multiple expressions

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on code in PR #5908:
URL: https://github.com/apache/arrow-datafusion/pull/5908#discussion_r1160969184


##########
datafusion/expr/src/type_coercion/aggregates.rs:
##########
@@ -263,6 +263,14 @@ fn check_arg_count(
                 )));
             }
         }
+        TypeSignature::VariadicAny => {
+            if input_types.is_empty() {
+                return Err(DataFusionError::Plan(format!(
+                    "The function {:?} expects at least one arguments",

Review Comment:
   ```suggestion
                       "The function {:?} expects at least one argument",
   ```



##########
datafusion/expr/src/type_coercion/functions.rs:
##########
@@ -78,6 +78,9 @@ fn get_valid_types(
                 .map(|_| current_types[0].clone())
                 .collect()]
         }
+        TypeSignature::VariadicAny => vec![(0..current_types.len())
+            .map(|i| current_types[i].clone())
+            .collect()],

Review Comment:
   Would the more simple `vec![current_types.iter().map(|c| c.clone()).collect()]` work too?



##########
datafusion/physical-expr/src/aggregate/count.rs:
##########
@@ -53,11 +55,43 @@ impl Count {
     ) -> Self {
         Self {
             name: name.into(),
-            expr,
+            exprs: vec![expr],
             data_type,
             nullable: true,
         }
     }
+
+    pub fn new_with_multiple_exprs(
+        exprs: Vec<Arc<dyn PhysicalExpr>>,
+        name: impl Into<String>,
+        data_type: DataType,
+    ) -> Self {
+        Self {
+            name: name.into(),
+            exprs,
+            data_type,
+            nullable: true,
+        }
+    }
+}
+
+/// count null values for multiple columns
+/// for each row if one column value is null, then null_count + 1
+fn null_count_for_multiple_cols(values: &[ArrayRef]) -> usize {

Review Comment:
   I have a feeling this can be done with a simpler logic but we can always refactor later.



##########
datafusion/expr/src/signature.rs:
##########
@@ -46,6 +46,8 @@ pub enum TypeSignature {
     // A function such as `array` is `VariadicEqual`
     // The first argument decides the type used for coercion
     VariadicEqual,
+    /// arbitrary number of arguments of an arbitrary type

Review Comment:
   This means to say all arguments can have different types right? I suggest:
   ```suggestion
       /// arbitrary number of arguments with arbitrary types
   ```



##########
datafusion/core/tests/sql/errors.rs:
##########
@@ -58,7 +58,7 @@ async fn test_aggregation_with_bad_arguments() -> Result<()> {
     assert_eq!(
         err,
         DataFusionError::Plan(
-            "The function Count expects 1 arguments, but 0 were provided".to_string()
+            "The function Count expects at least one arguments".to_string()

Review Comment:
   ```suggestion
               "The function Count expects at least one argument".to_string()
   ```



##########
datafusion/expr/src/signature.rs:
##########
@@ -89,6 +91,13 @@ impl Signature {
             volatility,
         }
     }
+    /// variadic_any - Creates a variadic signature that represents an arbitrary number of arguments of the any type.

Review Comment:
   ```suggestion
       /// variadic_any - Creates a variadic signature that represents an arbitrary number of arguments of any type.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org