You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jiangzhx (via GitHub)" <gi...@apache.org> on 2023/03/02 03:31:22 UTC
[GitHub] [arrow-datafusion] jiangzhx opened a new issue, #5444: Expr.alias function not work with count aggregation
jiangzhx opened a new issue, #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444
**Describe the bug**
Expr.alias function not work with count aggregation,it's should be alias column name with origin column name
`datafusion_expr::count(col(f.name())).alias(f.name())
`<img width="1796" alt="image" src="https://user-images.githubusercontent.com/494507/222323287-5e0e1978-b8a6-406b-b324-c58e04c8f684.png">
physical_plan
```
ProjectionExec: expr=[7300 as COUNT(id), 7300 as COUNT(tinyint_col), 7300 as COUNT(smallint_col), 7300 as COUNT(int_col), 7300 as COUNT(bigint_col), 7300 as COUNT(float_col), 7300 as COUNT(double_col), 7300 as COUNT(date_string_col), 7300 as COUNT(string_col), 7300 as COUNT(timestamp_col), 7300 as COUNT(year), 7300 as COUNT(month)]
EmptyExec: produce_one_row=true
```
**To Reproduce**
```
#[tokio::test]
async fn count_test() -> Result<()> {
let ctx = SessionContext::new();
let testdata = datafusion::test_util::parquet_test_data();
let filename = &format!("{testdata}/alltypes_tiny_pages.parquet");
let df = ctx
.read_parquet(filename, ParquetReadOptions::default())
.await?;
let cnt = df
.clone()
.aggregate(
vec![],
df.schema()
.fields()
.iter()
.clone()
.filter(|f| !matches!(f.data_type(), DataType::Boolean))
// wrong result,alias column with count like "count(id)"
.map(|f| datafusion_expr::count(col(f.name())).alias(f.name()))
// right result,alias column with origin table name
// .map(|f| datafusion_expr::max(col(f.name())).alias(f.name()))
// right result,alias column with origin table name
// .map(|f| datafusion_expr::count_distinct(col(f.name())).alias(f.name()))
.collect::<Vec<_>>(),
)
.unwrap();
print_batches(&cnt.clone().collect().await.unwrap()).expect("panic message");
let plan = cnt.create_physical_plan().await.unwrap();
let displayable_plan = displayable(plan.as_ref());
let plan_string = format!("{}", displayable_plan.indent());
println!("{}", plan_string);
Ok(())
}
```
**Expected behavior**
right result,alias column name with origin column name
`.map(|f| datafusion_expr::max(col(f.name())).alias(f.name()))`
<img width="1456" alt="image" src="https://user-images.githubusercontent.com/494507/222323990-fddf64de-b43c-4f9c-909b-e8b45b080832.png">
physical_plan
```
AggregateExec: mode=Final, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month]
AggregateExec: mode=Partial, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month]
ParquetExec: limit=None, partitions={1 group: [[Users/sylar/workspace/opensource/arrow-datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month]
```
**Additional context**
Add any other context about the problem here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451745719
found some thing not right.
https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/execution/context.rs#L1566
disable AggregateStatistics optimizer will return right physical plan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451815153
more and more research.
https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L176-L180
look like the reason is :
`alltypes_plain.parquet` data without stats.
`alltypes_plain.alltypes_tiny_pages` data with stats.
so AggregateStatistics will work with `alltypes_plain.alltypes_tiny_pages` not work with `alltypes_plain.parquet`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451282864
more info,difference data file has difference physical plan
alltypes_plain.parquet work well.
```
AggregateExec: mode=Final, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col]
AggregateExec: mode=Partial, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col]
ParquetExec: limit=None, partitions={1 group: [[Users/sylar/workspace/opensource/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col]
```
alltypes_tiny_pages.parquet not right.
```
ProjectionExec: expr=[7300 as COUNT(id), 7300 as COUNT(tinyint_col), 7300 as COUNT(smallint_col), 7300 as COUNT(int_col), 7300 as COUNT(bigint_col), 7300 as COUNT(float_col), 7300 as COUNT(double_col), 7300 as COUNT(date_string_col), 7300 as COUNT(string_col), 7300 as COUNT(timestamp_col), 7300 as COUNT(year), 7300 as COUNT(month)]
EmptyExec: produce_one_row=true
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation
Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.
jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451841791
The following code can resolve my problem.
but anyone can give some good suggestion to make code look nice.
```
#[tokio::test]
async fn count_df() -> Result<()> {
let ctx = SessionContext::new();
let testdata = datafusion::test_util::parquet_test_data();
let filename = &format!("{testdata}/alltypes_tiny_pages.parquet");
let df = ctx
.read_parquet(filename, ParquetReadOptions::default())
.await?;
let cnt = df
.clone()
.aggregate(
vec![],
vec![
datafusion_expr::count(col("id")),
datafusion_expr::count(col("bool_col")),
],
)
.unwrap();
let cnt = cnt
.clone()
.select(
cnt.schema()
.fields()
.iter()
.zip(df.schema().fields())
.map(|(count_field, orgin_field)| {
col(count_field.name()).alias(orgin_field.name())
})
.collect::<Vec<_>>(),
)
.unwrap();
let physical_plan = ctx
.state()
.create_physical_plan(&cnt.logical_plan())
.await
.unwrap();
println!("{}", displayable(physical_plan.as_ref()).indent());
let batch = collect(physical_plan, ctx.task_ctx()).await.unwrap();
print_batches(&*batch);
Ok(())
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #5444: Expr.alias function not work with count aggregation
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5444: Expr.alias function not work with count aggregation
URL: https://github.com/apache/arrow-datafusion/issues/5444
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Jefffrey commented on issue #5444: Expr.alias function not work with count aggregation
Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1452440185
Yeah the reason is that `aggregate_statistics` physical optimizer rule sets the name of the output column like so:
https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L192-L196
Ideally would take `casted_expr.name()` instead which would contain the alias you gave it. However what I've noticed is that it can contain the table as part of the name. So if it's switched to `casted_expr.name()` then following code:
```rust
ctx
.table("p")
.await?
.aggregate(vec![], vec![count(col("id"))])?
.show().await?;
```
Will output:
```
+-------------+
| COUNT(p.id) |
+-------------+
| 7300 |
+-------------+
```
Note the `p.id` instead of just `id`.
P.S. this affects the other aggregate functions too, e.g.
https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L224
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org