You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/02 11:55:28 UTC

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2674: Fix `AggregateStatistics` optimization so it doesn't change output type

tustvold commented on code in PR #2674:
URL: https://github.com/apache/arrow-datafusion/pull/2674#discussion_r887861790


##########
datafusion/core/src/physical_optimizer/aggregate_statistics.rs:
##########
@@ -293,38 +297,80 @@ mod tests {
     /// Checks that the count optimization was applied and we still get the right result
     async fn assert_count_optim_success(plan: AggregateExec, nulls: bool) -> Result<()> {
         let session_ctx = SessionContext::new();
-        let task_ctx = session_ctx.task_ctx();
         let conf = session_ctx.copied_config();
-        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?;
+        let plan = Arc::new(plan) as _;
+        let optimized = AggregateStatistics::new().optimize(Arc::clone(&plan), &conf)?;
 
         let (col, count) = match nulls {

Review Comment:
   I was very confused by what this parameter controls, should it not be something like `column: Option<&str>` instead?



##########
datafusion/core/src/physical_optimizer/aggregate_statistics.rs:
##########
@@ -293,38 +297,80 @@ mod tests {
     /// Checks that the count optimization was applied and we still get the right result
     async fn assert_count_optim_success(plan: AggregateExec, nulls: bool) -> Result<()> {
         let session_ctx = SessionContext::new();
-        let task_ctx = session_ctx.task_ctx();
         let conf = session_ctx.copied_config();
-        let optimized = AggregateStatistics::new().optimize(Arc::new(plan), &conf)?;
+        let plan = Arc::new(plan) as _;
+        let optimized = AggregateStatistics::new().optimize(Arc::clone(&plan), &conf)?;
 
         let (col, count) = match nulls {
-            false => (Field::new("COUNT(UInt8(1))", DataType::UInt64, false), 3),
-            true => (Field::new("COUNT(a)", DataType::UInt64, false), 2),
+            false => (Field::new(COUNT_STAR_NAME, DataType::Int64, false), 3),
+            true => (Field::new("COUNT(a)", DataType::Int64, false), 2),
         };
 
         // A ProjectionExec is a sign that the count optimization was applied
         assert!(optimized.as_any().is::<ProjectionExec>());
+        let task_ctx = session_ctx.task_ctx();
         let result = common::collect(optimized.execute(0, task_ctx)?).await?;
         assert_eq!(result[0].schema(), Arc::new(Schema::new(vec![col])));
         assert_eq!(
             result[0]
                 .column(0)
                 .as_any()
-                .downcast_ref::<UInt64Array>()
+                .downcast_ref::<Int64Array>()
                 .unwrap()
                 .values(),
             &[count]
         );
+
+        // Validate that the optimized plan returns the exact same
+        // answer (both schema and data) as the original plan
+        let task_ctx = session_ctx.task_ctx();
+        let plan_result = common::collect(plan.execute(0, task_ctx)?).await?;
+        assert_eq!(normalize(result), normalize(plan_result));

Review Comment:
   A couple of lines above there is
   
   ```
   assert_eq!(result[0].schema(), Arc::new(Schema::new(vec![col])));
   ```
   
   This would suggest to me that the result has a single column and a single field. Perhaps we could just do something like
   
   ```
   let expected_a_schema = ..;
   let expected_b_schema = ..;
   for (a, b) in result.iter().zip(plan_result) {
     assert_eq!(a.column(0), b.column(0);
     assert_eq!(a.schema(), expected_a_schema);
     assert_eq!(b.schema(), expected_b_schema);
   }
   ```
   
   I think the normalization logic is a little bit hard to follow...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org