You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/10 21:23:54 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #850: Optimize hash_aggregate when there are no null group keys

alamb opened a new issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
The code in hash_aggregate.rs is general and works for data with and without nulls. However there are optimizations that can be done. One such optimization is suggested by @andygrove and @Dandandan on https://github.com/apache/arrow-datafusion/pull/844#discussion_r686066032, namely add an optimized code path when there are no NULL values in the input groups that will avoid the cost of checking for null on each group.

While this might sound trivial the null check is on the hot path (done for every single row that is grouped) so removing it may improve performance by a measurable amount.

**Describe the solution you'd like**
1. A new function or parameter in `ScalarVaue::eq_array` (e.g. `ScalarValue::eq_array_non_null`) that assumes the input has no nulls and does not check `Array::is_valid`
2. A check in hash_aggregate if the null count in all group columns is 0 and invokes the specialized version of ScalarValue::eq_array_non_null if so
3. Some sort of performance benchmark results showing that it improves grouping performance (there is a list of benchmarks on #808 that might be able to inspire you)

**Describe alternatives you've considered**
The performance benefit may not be worth the additional code complexity, but we won't know until we try

**Additional context**
Add any other context or screenshots about the feature request here.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-896325453


   I think this is a reasonable first issue for someone if they are interested. The trick will be finding some benchmark where the null check matters


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-900185480


   @novemberkilo  assigned 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] novemberkilo commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

novemberkilo commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-899002276


   > A check in hash_aggregate if the null count in all group columns is 0
   
   What is `all group columns` referring to here please? 
   
   Perhaps is this the same as checking that for each `array` that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that `array.null_count() == 0` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-932727745


   I think given the experience of @novemberkilo  on this issue, we can conclude this is not an easy issue (and maybe not worth doing at all) so removing the label


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-899812406


   Thanks @novemberkilo 
   
   > Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0
   
   Yes, I think that is right. So rather than
   
   ```rust
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| scalar.eq_array(array, row))
   ```
   The idea would be to write something like this
   
   ```rust
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| { 
                     if array.null_count > 0 { 
                       scalar.eq_array(array, row))
                     } else  { 
                       scalar.eq_array_no_nulls(array, row))
                     } 
                   }) 
   ```
   
   But `ScalarValue::eq_array_no_nulls` does not exist yet -- you would have to write it / test it
   
   Although now on second thought I think the `if` needs to be hoisted out of the loop:
   
   ```rust
             if (array.null_count() > 0) {
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| scalar.eq_array(array, row))
            } else {
               // special case no null values
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| scalar.eq_array_no_nulls(array, row))
            }
   ```
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] novemberkilo commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

novemberkilo commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-899849945


   Thanks - that was the direction I was headed in too. Am keen to pick this up so please assign to me as appropriate? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] novemberkilo commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

novemberkilo commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-898997262


   @alamb I am interested in picking this up if this is an appropriate way to begin contributing to this project.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-899812406


   Thanks @novemberkilo 
   
   > Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0
   
   Yes, I think that is right. So rather than
   
   ```rust
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| scalar.eq_array(array, row))
   ```
   The idea would be to write something like this
   
   ```rust
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| { 
                     if array.null_count > 0 { 
                       scalar.eq_array(array, row))
                     } else  { 
                       scalar.eq_array_no_nulls(array, row))
                     } 
   ```
   
   But `ScalarValue::eq_array_no_nulls` does not exist yet -- you would have to write it / test it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb edited a comment on issue #850: Optimize hash_aggregate when there are no null group keys

Posted by GitBox <gi...@apache.org>.

alamb edited a comment on issue #850:
URL: https://github.com/apache/arrow-datafusion/issues/850#issuecomment-899812406


   Thanks @novemberkilo 
   
   > Perhaps is this the same as checking that for each array that appears on https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L371 we check that array.null_count() == 0
   
   Yes, I think that is right. So rather than
   
   ```rust
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| scalar.eq_array(array, row))
   ```
   The idea would be to write something like this
   
   ```rust
               group_values
                   .iter()
                   .zip(group_state.group_by_values.iter())
                   .all(|(array, scalar)| { 
                     if array.null_count > 0 { 
                       scalar.eq_array(array, row))
                     } else  { 
                       scalar.eq_array_no_nulls(array, row))
                     } 
                   }) 
   ```
   
   But `ScalarValue::eq_array_no_nulls` does not exist yet -- you would have to write it / test it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org