You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jiangzhx (via GitHub)" <gi...@apache.org> on 2023/03/09 04:45:43 UTC
[GitHub] [arrow-datafusion] jiangzhx opened a new pull request, #5518: Cannot Count(Expr:Wildcard) with DataFrame API
jiangzhx opened a new pull request, #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518
# Which issue does this PR close?
Closes #5743.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#discussion_r1140068980
##########
datafusion/core/tests/dataframe.rs:
##########
@@ -35,6 +36,58 @@ use datafusion::{assert_batches_eq, assert_batches_sorted_eq};
use datafusion_expr::expr::{GroupingSet, Sort};
use datafusion_expr::{avg, col, count, lit, max, sum, Expr, ExprSchemable};
+#[tokio::test]
+async fn count_wildcard() -> Result<()> {
+ let ctx = SessionContext::new();
+ let testdata = datafusion::test_util::parquet_test_data();
+
+ ctx.register_parquet(
+ "alltypes_tiny_pages",
+ &format!("{testdata}/alltypes_tiny_pages.parquet"),
+ ParquetReadOptions::default(),
+ )
+ .await?;
+
+ let sql_results = ctx
+ .sql("select count(*) from alltypes_tiny_pages")
+ .await?
+ .explain(false, false)?
+ .collect()
+ .await?;
+
+ let df_results = ctx
+ .table("alltypes_tiny_pages")
+ .await?
+ .aggregate(vec![], vec![count(Expr::Wildcard)])?
+ .explain(false, false)
+ .unwrap()
+ .collect()
+ .await?;
+
+ //make sure sql plan same with df plan
+ assert_eq!(
Review Comment:
👍
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1473133081
try adding AnalyzerRule #5627 to resolve this bug
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1467436935
As per everyone's suggestions, I made some modifications.
Can @mingmwang and @Jefffrey take a look and help with code review when you have time.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1474831821
Closing in favor of https://github.com/apache/arrow-datafusion/pull/5518 (so it is clear which we are trying to merge)
Thank you @jiangzhx and @Jefffrey
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
URL: https://github.com/apache/arrow-datafusion/pull/5518
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1469621120
> I think such logic should be moved to the Analyzer.
@mingmwang
I understand your architecture design now when I see your PR #5570. After the PR is merged, I will migrate the current logic into your solution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1466138375
>
thanks @mingmwang , as @Jefffrey mention
SQL substitute function is here:
https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/sql/src/expr/function.rs#L193-L218
but use dafaframe will got error
```
ctx.sql("select count(*) from alltypes_plain")
.await?
.explain(false, false)?
.show()
.await?;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on code in PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#discussion_r1135089474
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -986,6 +994,46 @@ impl LogicalPlanBuilder {
}
}
+//handle Count(Expr:Wildcard) with DataFrame API
+pub fn handle_wildcard(exprs: Vec<Expr>) -> Result<Vec<Expr>> {
+ let exprs: Vec<Expr> = exprs
+ .iter()
+ .map(|expr| {
+ if let Expr::AggregateFunction(AggregateFunction {
+ fun,
+ args,
+ distinct,
+ filter,
+ }) = expr
+ {
+ if let aggregate_function::AggregateFunction::Count = fun {
+ if args.len() == 1 {
+ let arg = args.get(0).unwrap().clone();
+ match arg {
+ Expr::Wildcard => {
+ Expr::AggregateFunction(AggregateFunction {
+ fun: fun.clone(),
+ args: vec![lit(ScalarValue::UInt8(Some(1)))],
+ distinct: *distinct,
+ filter: filter.clone(),
+ })
+ }
+ _ => expr.clone(),
+ }
+ } else {
+ expr.clone()
+ }
+ } else {
+ expr.clone()
+ }
+ } else {
+ expr.clone()
+ }
+ })
+ .collect();
+ Ok(exprs)
+}
Review Comment:
the extreme nesting is not ideal, i think can simplify to this:
```suggestion
//handle Count(Expr:Wildcard) with DataFrame API
pub fn handle_wildcard(exprs: Vec<Expr>) -> Result<Vec<Expr>> {
let exprs: Vec<Expr> = exprs
.iter()
.map(|expr| match expr {
Expr::AggregateFunction(AggregateFunction {
fun: aggregate_function::AggregateFunction::Count,
args,
distinct,
filter,
}) if args.len() == 1 => match args[0] {
Expr::Wildcard => Expr::AggregateFunction(AggregateFunction {
fun: aggregate_function::AggregateFunction::Count,
// TODO: replace with the constant
args: vec![lit(ScalarValue::UInt8(Some(1)))],
distinct: *distinct,
filter: filter.clone(),
}),
_ => expr.clone(),
},
_ => expr.clone(),
})
.collect();
Ok(exprs)
}
```
unsure if can simplify more, feel free to explore that
also check my TODO comment on the constant (i think i mention it in the original issue)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] mingmwang commented on pull request #5518: Cannot Count(Expr:Wildcard) with DataFrame API
Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5518:
URL: https://github.com/apache/arrow-datafusion/pull/5518#issuecomment-1465865498
I think such logic should be moved to the Analyzer.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org