You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "andygrove (via GitHub)" <gi...@apache.org> on 2023/04/10 14:16:20 UTC

[GitHub] [arrow-datafusion] andygrove opened a new issue, #5944: Fuse grouped aggregate and filter operators for improved performance

andygrove opened a new issue, #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944

   ### Is your feature request related to a problem or challenge?
   
   When we perform a grouped aggregate on a filtered input (such as with TPC-H q1), the filter operator performs two main tasks:
   
   - Evaluate the filter predicate (usually very fast)
   - Create new batches and copy over the filtered data (very slow if the filter is not very selective, as in q1)
   
   I wonder if we would see a significant performance improvement if we could avoid creating the filtered batches in this case.
   
   One idea would be to create the filtered batches by copying the arrays and mutating the validity bitmap to hide the rows that are filtered out. This would potentially change the semantics in some cases though so we can probably only do this under certain conditions.
   
   Another idea is to update the aggregate logic to perform the predicate evaluation and then use the resulting bitmap to determine which rows to accumulate.
   
   
   
   ### Describe the solution you'd like
   
   I am working on a small prototype of this, outside of DataFusion, that I will share once the code is less embarrassing.
   
   ### Describe alternatives you've considered
   
   It would be worth seeing how other engines handle this.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1501894794

   In q1 this would remove `l_shipdate` from the list of 7 columns to be copied, so I would expect a smaller improvement here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1591074835

   > This would potentially change the semantics in some cases though so we can probably only do this under certain conditions
   
   I'm curious about this, in what situation would the nullability or not of a non-selected value matter? It is just going to be discarded regardless? See https://github.com/apache/arrow-rs/issues/3620


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "andygrove (via GitHub)" <gi...@apache.org>.
andygrove commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1501874052

   @Dandandan @alamb wdyt?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1501890128

   One source of inefficiency (and should be relatively easy to change) is that currently we output the entire `RecordBatch`, including the columns that are needed to evaluate the filter, while throwing those columns away in the following `Projection`.
   
   See:
    See https://github.com/apache/arrow-datafusion/issues/5436


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "Dandandan (via GitHub)" <gi...@apache.org>.
Dandandan commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1501896826

   I remind seeing some issue/papers about a similar approach  before to this, maybe those were shared by @alamb ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1502083042

   > One idea would be to create the filtered batches by copying the arrays and mutating the validity bitmap to hide the rows that are filtered out. This would potentially change the semantics in some cases though so we can probably only do this under certain conditions.
   
   I think this basic idea is called a "selection vector" in the literature -- and as you hint at, it is not quite the same as the null mask as it has different semantics.
   
   One approach might be to add another enum type to `ColumnarValue` that had an additional validity mask
   
   https://github.com/apache/arrow-datafusion/blob/bbc71692fcd8dd9f3a9686162e59d092b37031f2/datafusion/expr/src/columnar_value.rs#L33
   
   
   After @tustvold 's recent work in Arrow, I think this would just be a https://docs.rs/arrow/latest/arrow/buffer/struct.BooleanBuffer.html and should be straightforward to use.
   
   To really take advantage of a selection vector, however, the underlying compute kernels need to be updated to know how to ignore the selection vectors (and likely only do so when they are sparse)
   
   
   > Another idea is to update the aggregate logic to perform the predicate evaluation and then use the resulting bitmap to determine which rows to accumulate.
   
   While not exactly the same,  @yjshen 's has been workking to add filtering to the aggregate input here, which is similar:  https://github.com/apache/arrow-datafusion/pull/5868
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on issue #5944: Fuse grouped aggregate and filter operators for improved performance

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #5944:
URL: https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1502364843

   https://github.com/apache/arrow-rs/issues/3620 may be related


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org