You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "rtpsw (via GitHub)" <gi...@apache.org> on 2023/04/02 15:18:04 UTC

[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493370198

   @westonpace, I've looked into this for a while and things are confusing.
   
   First, I've found that Acero aggregation produces an output schema different than the one you expect here, and in particular one that collects measurements then keys then segment keys. Here are my observations:
   - [The `ScalarAggregateNode` class](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L255), for which the keys are empty, gets [an output scheme](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L377) that collects [measurements](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L369) then [segment keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L372-L373).
   - [The `GroupByNode` class](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L533) gets [an output schema](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L663) that collects [measurements](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L648-L649) then [keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L654) then [segment keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L659).
   - A [comment in `GroupByNode`](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L646) says: "Aggregate fields come before key fields to match the behavior of GroupBy function".
   
   Second, the [output-schema-building Arrow Substrait code](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/engine/substrait/relation_internal.cc#L334-L357) you noted is trying to collect measurement then keys then segment keys. However, as you noted, it looks like it collects the measurements from the input schema instead of the output schema. I think the code should first construct the output schema (which involves accessing the `name` field of each `Aggregate`) and then collect from it.
   
   Third, the Substrait spec may have a different idea about the expected field order of an aggregation. Assuming so, we should probably fix Acero to use this order, especially since the `GroupBy` function mentioned in the comment [no longer exists](https://github.com/apache/arrow/issues/14866).
   
   Fourth, I suspect that a deeper reason for why we get away with such bugs is that Arrow Substrait generally doesn't care about field names, since Substrait generally doesn't (until the final output), so as long as the field type happens to be right then the test-case passes. Perhaps, besides testing after-projection, we should make the existing test-cases produce different types per field as much as possible.
   
   cc @icexelloss


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org