You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "jiangzhx (via GitHub)" <gi...@apache.org> on 2023/03/02 03:31:22 UTC

[GitHub] [arrow-datafusion] jiangzhx opened a new issue, #5444: Expr.alias function not work with count aggregation

jiangzhx opened a new issue, #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444

   **Describe the bug**
   Expr.alias function not work with count aggregation,it's should be alias column name with origin column name
   `datafusion_expr::count(col(f.name())).alias(f.name())
   `<img width="1796" alt="image" src="https://user-images.githubusercontent.com/494507/222323287-5e0e1978-b8a6-406b-b324-c58e04c8f684.png">
   
   physical_plan
   ```
   ProjectionExec: expr=[7300 as COUNT(id), 7300 as COUNT(tinyint_col), 7300 as COUNT(smallint_col), 7300 as COUNT(int_col), 7300 as COUNT(bigint_col), 7300 as COUNT(float_col), 7300 as COUNT(double_col), 7300 as COUNT(date_string_col), 7300 as COUNT(string_col), 7300 as COUNT(timestamp_col), 7300 as COUNT(year), 7300 as COUNT(month)]
     EmptyExec: produce_one_row=true
   
   ```
   
   
   **To Reproduce**
   
   ```
   #[tokio::test]
   async fn count_test() -> Result<()> {
       let ctx = SessionContext::new();
       let testdata = datafusion::test_util::parquet_test_data();
       let filename = &format!("{testdata}/alltypes_tiny_pages.parquet");
       let df = ctx
           .read_parquet(filename, ParquetReadOptions::default())
           .await?;
       let cnt = df
           .clone()
           .aggregate(
               vec![],
               df.schema()
                   .fields()
                   .iter()
                   .clone()
                   .filter(|f| !matches!(f.data_type(), DataType::Boolean))
                   // wrong result,alias column with count like "count(id)"
                   .map(|f| datafusion_expr::count(col(f.name())).alias(f.name()))
                   // right result,alias column with origin table name
                   // .map(|f| datafusion_expr::max(col(f.name())).alias(f.name()))
                   // right result,alias column with origin table name
                   // .map(|f| datafusion_expr::count_distinct(col(f.name())).alias(f.name()))
                   .collect::<Vec<_>>(),
           )
           .unwrap();
       print_batches(&cnt.clone().collect().await.unwrap()).expect("panic message");
       let plan = cnt.create_physical_plan().await.unwrap();
       let displayable_plan = displayable(plan.as_ref());
       let plan_string = format!("{}", displayable_plan.indent());
       println!("{}", plan_string);
   
       Ok(())
   }
   
   ```
   **Expected behavior**
   right result,alias column name with origin column name
   `.map(|f| datafusion_expr::max(col(f.name())).alias(f.name()))`
   <img width="1456" alt="image" src="https://user-images.githubusercontent.com/494507/222323990-fddf64de-b43c-4f9c-909b-e8b45b080832.png">
   physical_plan
   ```
   AggregateExec: mode=Final, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month]
     AggregateExec: mode=Partial, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month]
       ParquetExec: limit=None, partitions={1 group: [[Users/sylar/workspace/opensource/arrow-datafusion/parquet-testing/data/alltypes_tiny_pages.parquet]]}, projection=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col, year, month]
   
   ```
   
   
   
   **Additional context**
   Add any other context about the problem here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.

jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451745719

   found some thing not right.
   https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/execution/context.rs#L1566
   
   disable AggregateStatistics optimizer will return right physical plan.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.

jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451815153

   more and more research.
   
   https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L176-L180
   
   look like the reason is :
   `alltypes_plain.parquet`                          data without stats.
   `alltypes_plain.alltypes_tiny_pages`     data with stats.
   
   so AggregateStatistics will work with `alltypes_plain.alltypes_tiny_pages`  not work with `alltypes_plain.parquet`     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.

jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451282864

    more info,difference data file has difference physical plan
   
   alltypes_plain.parquet   work well.
   ```
   AggregateExec: mode=Final, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col]
     AggregateExec: mode=Partial, gby=[], aggr=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col]
       ParquetExec: limit=None, partitions={1 group: [[Users/sylar/workspace/opensource/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet]]}, projection=[id, tinyint_col, smallint_col, int_col, bigint_col, float_col, double_col, date_string_col, string_col, timestamp_col]
   
   ```
   
   
   alltypes_tiny_pages.parquet not right.
   ```
   ProjectionExec: expr=[7300 as COUNT(id), 7300 as COUNT(tinyint_col), 7300 as COUNT(smallint_col), 7300 as COUNT(int_col), 7300 as COUNT(bigint_col), 7300 as COUNT(float_col), 7300 as COUNT(double_col), 7300 as COUNT(date_string_col), 7300 as COUNT(string_col), 7300 as COUNT(timestamp_col), 7300 as COUNT(year), 7300 as COUNT(month)]
     EmptyExec: produce_one_row=true
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jiangzhx commented on issue #5444: Expr.alias function not work with count aggregation

Posted by "jiangzhx (via GitHub)" <gi...@apache.org>.

jiangzhx commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1451841791

   The following code can resolve my problem.
   but anyone can give some good suggestion to make code look nice.
   ```
   #[tokio::test]
   async fn count_df() -> Result<()> {
       let ctx = SessionContext::new();
       let testdata = datafusion::test_util::parquet_test_data();
       let filename = &format!("{testdata}/alltypes_tiny_pages.parquet");
       let df = ctx
           .read_parquet(filename, ParquetReadOptions::default())
           .await?;
       let cnt = df
           .clone()
           .aggregate(
               vec![],
               vec![
                   datafusion_expr::count(col("id")),
                   datafusion_expr::count(col("bool_col")),
               ],
           )
           .unwrap();
       let cnt = cnt
           .clone()
           .select(
               cnt.schema()
                   .fields()
                   .iter()
                   .zip(df.schema().fields())
                   .map(|(count_field, orgin_field)| {
                       col(count_field.name()).alias(orgin_field.name())
                   })
                   .collect::<Vec<_>>(),
           )
           .unwrap();
   
       let physical_plan = ctx
           .state()
           .create_physical_plan(&cnt.logical_plan())
           .await
           .unwrap();
       println!("{}", displayable(physical_plan.as_ref()).indent());
   
       let batch = collect(physical_plan, ctx.task_ctx()).await.unwrap();
   
       print_batches(&*batch);
       Ok(())
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #5444: Expr.alias function not work with count aggregation

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #5444: Expr.alias function not work with count aggregation
URL: https://github.com/apache/arrow-datafusion/issues/5444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on issue #5444: Expr.alias function not work with count aggregation

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on issue #5444:
URL: https://github.com/apache/arrow-datafusion/issues/5444#issuecomment-1452440185

   Yeah the reason is that `aggregate_statistics` physical optimizer rule sets the name of the output column like so:
   
   https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L192-L196
   
   Ideally would take `casted_expr.name()` instead which would contain the alias you gave it. However what I've noticed is that it can contain the table as part of the name. So if it's switched to `casted_expr.name()` then following code:
   
   ```rust
   ctx
       .table("p")
       .await?
       .aggregate(vec![], vec![count(col("id"))])?
       .show().await?;
   ```
   
   Will output:
   
   ```
   +-------------+
   | COUNT(p.id) |
   +-------------+
   | 7300        |
   +-------------+
   ```
   
   Note the `p.id` instead of just `id`.
   
   P.S. this affects the other aggregate functions too, e.g.
   
   https://github.com/apache/arrow-datafusion/blob/f68214dc6553ede5f23ba54cac6150f4c1e18e8c/datafusion/core/src/physical_optimizer/aggregate_statistics.rs#L224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org