You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "icexelloss (via GitHub)" <gi...@apache.org> on 2023/04/02 15:43:01 UTC
[GitHub] [arrow] icexelloss commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

icexelloss commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493375560

   Can we simply standardize the output column ordering to be segment keys +
   group by keys + measurements?
   
   On Sun, Apr 2, 2023 at 11:18 AM rtpsw ***@***.***> wrote:
   
   > @westonpace <https://github.com/westonpace>, I've looked into this for a
   > while and things are confusing.
   >
   > First, I've found that Acero aggregation produces an output schema
   > different than the one you expect here, and in particular one that collects
   > measurements then keys then segment keys. Here are my observations:
   >
   >    - The ScalarAggregateNode class
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L255>,
   >    for which the keys are empty, gets an output scheme
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L377>
   >    that collects measurements
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L369>
   >    then segment keys
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L372-L373>
   >    .
   >    - The GroupByNode class
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L533>
   >    gets an output schema
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L663>
   >    that collects measurements
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L648-L649>
   >    then keys
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L654>
   >    then segment keys
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L659>
   >    .
   >    - A comment in GroupByNode
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L646>
   >    says: "Aggregate fields come before key fields to match the behavior of
   >    GroupBy function".
   >
   > Second, the output-schema-building Arrow Substrait code
   > <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/engine/substrait/relation_internal.cc#L334-L357>
   > you noted is trying to collect measurement then keys then segment keys.
   > However, as you noted, it looks like it collects the measurements from the
   > input schema instead of the output schema. I think the code should first
   > construct the output schema (which involves accessing the name field of
   > each Aggregate) and then collect from it.
   >
   > Third, the Substrait spec may have a different idea about the expected
   > field order of an aggregation. Assuming so, we should probably fix Acero to
   > use this order, especially since the GroupBy function mentioned in the
   > comment no longer exists <https://github.com/apache/arrow/issues/14866>.
   >
   > Fourth, I suspect that a deeper reason for why we get away with such bugs
   > is that Arrow Substrait generally doesn't care about field names, since
   > Substrait generally doesn't (until the final output), so as long as the
   > field type happens to be right then the test-case passes. Perhaps, besides
   > testing after-projection, we should make the existing test-cases produce
   > different types per field as much as possible.
   >
   > cc @icexelloss <https://github.com/icexelloss>
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/issues/34786#issuecomment-1493370198>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAGBXLFUCGOFVFMU23ZYMQLW7GKDLANCNFSM6AAAAAAWMXR3GU>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org