You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/31 02:33:00 UTC

[GitHub] [arrow-datafusion] james727 opened a new pull request #1511: Implement set_agg aggregate function

james727 opened a new pull request #1511:
URL: https://github.com/apache/arrow-datafusion/pull/1511


   👋 hi all - first PR here.
   
   # Which issue does this PR close?
   This closes https://github.com/apache/arrow-datafusion/issues/1323.
   
    # Rationale for this change
   This provides an efficient way to aggregate unique values into an array. This is beneficial for aggregating low cardinality fields where `array_agg` may require significantly more memory than `set_agg`.
   
   I mainly implemented this as a way to get familiar the codebase. Though - I'm not 100% sure merging this actually makes sense if the goal of the project is to be as Postgres-like as possible. `set_agg` is supported by Presto (as linked in the issue above) and other DBMS, but Postgres neither supports `set_agg` nor `array_distinct`. The recommended approach to something like `set_agg` in Postgres is to use `array_agg`, unnest the values, select distinct, then use `array_agg` again on the output.
   
   That said - this does seem generally useful, and the Postgres approach is less efficient (and likely unworkable for certain datasets in a distributed environment). I'm interested to hear feedback from the maintainers on the above.
   
   # What changes are included in this PR?
   This includes the implementation of `set_agg` and a couple tests. It borrows heavily from the patch that implemented `array_agg`: https://github.com/apache/arrow-datafusion/pull/1300
   
   # Open questions
   There's a couple specific points I could use feedback on:
   - Using `hashbrown::HashSet` for the accumulator instead of `std::collections::HashSet` - it seems this is preferred in the codebase.
   - Tests - this was actually the most difficult part of writing this as output ordering of `set_agg` is nondeterministic. I managed to hack it together but I'm sure there's an easier way (for both integration and unit tests).
   - Documentation - In general, what docs need to be updated? And given this is a divergence from Postgres, is there anywhere specific this should be called out?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] james727 commented on pull request #1511: Implement set_agg aggregate function

Posted by GitBox <gi...@apache.org>.
james727 commented on pull request #1511:
URL: https://github.com/apache/arrow-datafusion/pull/1511#issuecomment-1003459612


   Closing this for now as upon further review, `array_agg(distinct <expr>)` should work fine for this use case, but there are some bugs in how Datafusion handles non-column expressions in this syntax. Will open another issue when I figure out more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] james727 edited a comment on pull request #1511: Implement set_agg aggregate function

Posted by GitBox <gi...@apache.org>.
james727 edited a comment on pull request #1511:
URL: https://github.com/apache/arrow-datafusion/pull/1511#issuecomment-1003459612


   Closing this for now as upon further review, I think `array_agg(distinct <expr>)` should work fine for this use case, but there are some bugs in how Datafusion handles non-column expressions in this syntax. Will open another issue when I figure out more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] james727 commented on pull request #1511: Implement set_agg aggregate function

Posted by GitBox <gi...@apache.org>.
james727 commented on pull request #1511:
URL: https://github.com/apache/arrow-datafusion/pull/1511#issuecomment-1004342888


   Thanks @alamb - I think I have a solution for the strange behavior with complex expressions in distinct aggregations in https://github.com/apache/arrow-datafusion/pull/1519 - if you or someone else could have a look and let me know if this makes sense or is totally off base, it would be much appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1511: Implement set_agg aggregate function

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1511:
URL: https://github.com/apache/arrow-datafusion/pull/1511#issuecomment-1003554757


   Thanks for the contribution @james727  -- let us know if we can help out in some way 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] james727 closed pull request #1511: Implement set_agg aggregate function

Posted by GitBox <gi...@apache.org>.
james727 closed pull request #1511:
URL: https://github.com/apache/arrow-datafusion/pull/1511


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org