You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jiangzhx (via GitHub)" <gi...@apache.org> on 2023/03/09 04:45:43 UTC

[GitHub] [arrow-datafusion] jiangzhx opened a new pull request, #5518: Cannot Count(Expr:Wildcard) with DataFrame API

jiangzhx opened a new pull request, #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518

   # Which issue does this PR close?
   Closes #5743.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#discussion_r1140068980


##########
datafusion/core/tests/dataframe.rs:
##########
@@ -35,6 +36,58 @@ use datafusion::{assert_batches_eq, assert_batches_sorted_eq};
 use datafusion_expr::expr::{GroupingSet, Sort};
 use datafusion_expr::{avg, col, count, lit, max, sum, Expr, ExprSchemable};
 
+#[tokio::test]
+async fn count_wildcard() -> Result<()> {
+    let ctx = SessionContext::new();
+    let testdata = datafusion::test_util::parquet_test_data();
+
+    ctx.register_parquet(
+        "alltypes_tiny_pages",
+        &format!("{testdata}/alltypes_tiny_pages.parquet"),
+        ParquetReadOptions::default(),
+    )
+    .await?;
+
+    let sql_results = ctx
+        .sql("select count(*) from alltypes_tiny_pages")
+        .await?
+        .explain(false, false)?
+        .collect()
+        .await?;
+
+    let df_results = ctx
+        .table("alltypes_tiny_pages")
+        .await?
+        .aggregate(vec![], vec![count(Expr::Wildcard)])?
+        .explain(false, false)
+        .unwrap()
+        .collect()
+        .await?;
+
+    //make sure sql plan same with df plan
+    assert_eq!(

Review Comment:
   👍  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1473133081

   try adding AnalyzerRule #5627 to resolve this bug 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1467436935

   As per everyone's suggestions, I made some modifications. 
   Can @mingmwang  and @Jefffrey  take a look and help with code review when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1474831821

   Closing in favor of https://github.com/apache/arrow-datafusion/pull/5518 (so it is clear which we are trying to merge)
   
   Thank you @jiangzhx  and @Jefffrey 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
URL: https://github.com/apache/arrow-datafusion/pull/5518


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1469621120

   > I think such logic should be moved to the Analyzer.
   
   @mingmwang 
   I understand your architecture design now when I see your PR #5570. After the PR is merged, I will migrate the current logic into your solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1466138375

   > 
   thanks @mingmwang ,  as @Jefffrey mention 
   SQL substitute function is here:
   https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/sql/src/expr/function.rs#L193-L218
   
   but  use dafaframe  will got error
   ```
       ctx.sql("select count(*) from alltypes_plain")
           .await?
           .explain(false, false)?
           .show()
           .await?;
   
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#discussion_r1135089474


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -986,6 +994,46 @@ impl LogicalPlanBuilder {
     }
 }
 
+//handle Count(Expr:Wildcard) with DataFrame API
+pub fn handle_wildcard(exprs: Vec<Expr>) -> Result<Vec<Expr>> {
+    let exprs: Vec<Expr> = exprs
+        .iter()
+        .map(|expr| {
+            if let Expr::AggregateFunction(AggregateFunction {
+                fun,
+                args,
+                distinct,
+                filter,
+            }) = expr
+            {
+                if let aggregate_function::AggregateFunction::Count = fun {
+                    if args.len() == 1 {
+                        let arg = args.get(0).unwrap().clone();
+                        match arg {
+                            Expr::Wildcard => {
+                                Expr::AggregateFunction(AggregateFunction {
+                                    fun: fun.clone(),
+                                    args: vec![lit(ScalarValue::UInt8(Some(1)))],
+                                    distinct: *distinct,
+                                    filter: filter.clone(),
+                                })
+                            }
+                            _ => expr.clone(),
+                        }
+                    } else {
+                        expr.clone()
+                    }
+                } else {
+                    expr.clone()
+                }
+            } else {
+                expr.clone()
+            }
+        })
+        .collect();
+    Ok(exprs)
+}

Review Comment:
   the extreme nesting is not ideal, i think can simplify to this:
   
   ```suggestion
   //handle Count(Expr:Wildcard) with DataFrame API
   pub fn handle_wildcard(exprs: Vec<Expr>) -> Result<Vec<Expr>> {
       let exprs: Vec<Expr> = exprs
           .iter()
           .map(|expr| match expr {
               Expr::AggregateFunction(AggregateFunction {
                   fun: aggregate_function::AggregateFunction::Count,
                   args,
                   distinct,
                   filter,
               }) if args.len() == 1 => match args[0] {
                   Expr::Wildcard => Expr::AggregateFunction(AggregateFunction {
                       fun: aggregate_function::AggregateFunction::Count,
                       // TODO: replace with the constant
                       args: vec![lit(ScalarValue::UInt8(Some(1)))],
                       distinct: *distinct,
                       filter: filter.clone(),
                   }),
                   _ => expr.clone(),
               },
               _ => expr.clone(),
           })
           .collect();
       Ok(exprs)
   }
   ```
   
   unsure if can simplify more, feel free to explore that
   
   also check my TODO comment on the constant (i think i mention it in the original issue)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1465865498

   I think such logic should be moved to the Analyzer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org