You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/31 23:55:02 UTC

[GitHub] [arrow-datafusion] james727 opened a new issue #1512: Distinct aggregations work only partially

james727 opened a new issue #1512:
URL: https://github.com/apache/arrow-datafusion/issues/1512


   **Describe the bug**
   It seems like distinct aggregations (e.g. `SELECT SUM(DISTINCT <expr>) ...`, `SELECT ARRAY_AGG(DISTINCT <expr>)...`) are only partially supported by DataFusion. In specific circumstances they work as expected, and in other cases they throw errors. 
   
   Here's an example of a simple query that works as intended - to repro, add the following test case into `datafusion/tests/sql/aggregates.rs`:
   ```rust
   #[tokio::test]
   async fn csv_query_array_agg_distinct() -> Result<()> {
       let mut ctx = ExecutionContext::new();
       register_aggregate_csv(&mut ctx).await?;
       let sql =
           "SELECT array_agg(DISTINCT c2) FROM aggregate_test_100";
       let actual = execute_to_batches(&mut ctx, sql).await;
       let expected = vec![
           "+-----------------------------------------------+",
           "| ARRAYAGG(DISTINCT aggregate_test_100.c2) |",
           "+-----------------------------------------------+",
           "| [4, 2, 3, 5, 1]                                                          |",
           "+-----------------------------------------------+",
       ];
       assert_batches_eq!(expected, &actual);
       Ok(())
   }
   ```
   This works (well - it fails due to nondeterministic ordering, but it works as intended).
   
   Here are some queries that fail:
   
   - `SELECT array_agg(DISTINCT 2 * c2) FROM aggregate_test_100` - Fails with `Plan("No field named 'aggregate_test_100.c2'. Valid fields are 'Int64(5) * aggregate_test_100.c2'.")`
   
   - `SELECT array_agg(DISTINCT c2), sum(distinct 5*c2) c3 FROM aggregate_test_100` - Fails with `NotImplemented("SUM(DISTINCT) aggregations are not available")`
   
   I've narrowed the issue down to the optimization logic in `single_distinct_to_groupby.rs` - in fact, if I disable this optimization rule, even the basic cases don't work correctly. This suggests that this may be working accidentally?
   
   **To Reproduce**
   Repro steps above.
   
   **Expected behavior**
   Either this should be disabled entirely (similar to the `NotImplemented` error message for certain queries involving sum), or the queries above should be fixed.
   
   **Additional context**
   I found this when working through https://github.com/apache/arrow-datafusion/issues/1323 - I may be able to take a pass at fixing this, but it'd be helpful to get thoughts from people more familiar with the project on the appropriate path forward here first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #1512: Distinct aggregations work only partially

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1512:
URL: https://github.com/apache/arrow-datafusion/issues/1512


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org