You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/30 04:30:38 UTC

[GitHub] [arrow] westonpace opened a new issue, #34786: [C++] Output schema calculated by Substrait for Aggregate rel seems incorrect.

westonpace opened a new issue, #34786:
URL: https://github.com/apache/arrow/issues/34786

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   The code calculating the output schema is here:
   
   ```
         FieldVector output_fields;
         output_fields.reserve(key_field_ids.size() + measure_size);
         // extract aggregate fields to output schema
         for (const auto& agg_src_fieldset : agg_src_fieldsets) {
           for (int field : agg_src_fieldset) {
             output_fields.emplace_back(input_schema->field(field));
           }
         }
         // extract key fields to output schema
         for (int key_field_id : key_field_ids) {
           output_fields.emplace_back(input_schema->field(key_field_id));
         }
   
         std::shared_ptr<Schema> aggregate_schema = schema(std::move(output_fields));
   ```
   
   This appears to have two issues:
   
    * It is inserting the key fields after the measure fields
    * It is inserting measure fields based on the function inputs and not the function outputs
   
   I suspect we are getting away with it in many cases because we are not applying projection / emit after the aggregate.  At the very least, we should add some test cases that do this so we can verify the output schema is correct.  If there is indeed an issue then we should also fix that.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493719021

   > Can we simply standardize the output column ordering to be segment keys + group by keys + measurements?
   
   It's doable though not as simple as you might expect - updating the output column ordering would require at least the changing of the many expected outputs explicitly coded in `hash_aggregate_test.cc`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1494766563

   > It's doable though not as simple as you might expect - updating the output column ordering would require at least the changing of the many expected outputs explicitly coded in `hash_aggregate_test.cc`.
   
   #34551


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.
URL: https://github.com/apache/arrow/issues/34786


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1494748329

   > > Can you give an example of when does the output schema differs in "planning-time" vs "execution'time"?
   > 
   > At execution-time, one can, for example, [pass a `FunctionRegistry` to `ExecuteSerializedPlan`](https://github.com/apache/arrow/blob/0e1f5c6c2ce7504d86dd10d91ab7d29c2441d4ce/cpp/src/arrow/engine/substrait/util.h#L54-L58) that is different than the default `GetFunctionRegistry()` one. The passed registry is allowed to override functions, add kernels, or completely redefine the set of registered functions - making this registry incompatible with the default one used in planning-time for output schema purposes. Granted, this would not be a usual use-case, but it is a supported one.
   > 
   > My point is that it is a design problem when Acero supports a choice of function registry at execution-time but not at planning-time, because the two choices should be compatible for things to work.
   
   I see - Thanks for the explanation. I am not concerned then. (If someone is passing a Function Registry and changes output schema of existing functions then things certainly will not work)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493370198

   @westonpace, I've looked into this for a while and things are confusing.
   
   First, I've found that Acero aggregation produces an output schema different than the one you expect here, and in particular one that collects measurements then keys then segment keys. Here are my observations:
   - [The `ScalarAggregateNode` class](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L255), for which the keys are empty, gets [an output scheme](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L377) that collects [measurements](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L369) then [segment keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L372-L373).
   - [The `GroupByNode` class](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L533) gets [an output schema](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L663) that collects [measurements](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L648-L649) then [keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L654) then [segment keys](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L659).
   - A [comment in `GroupByNode`](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L646) says: "Aggregate fields come before key fields to match the behavior of GroupBy function".
   
   Second, the [output-schema-building Arrow Substrait code](https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/engine/substrait/relation_internal.cc#L334-L357) you noted is trying to collect measurement then keys then segment keys. However, as you noted, it looks like it collects the measurements from the input schema instead of the output schema. I think the code should first construct the output schema (which involves accessing the `name` field of each `Aggregate`) and then collect from it.
   
   Third, the Substrait spec may have a different idea about the expected field order of an aggregation. Assuming so, we should probably fix Acero to use this order, especially since the `GroupBy` function mentioned in the comment [no longer exists](https://github.com/apache/arrow/issues/14866).
   
   Fourth, I suspect that a deeper reason for why we get away with such bugs is that Arrow Substrait generally doesn't care about field names, since Substrait generally doesn't (until the final output), so as long as the field type happens to be right then the test-case passes. Perhaps, besides testing after-projection, we should make the existing test-cases produce different types per field as much as possible.
   
   cc @icexelloss


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493726176

   Another design issue to notice is that the output schema may (though not usually) differ between planning-time and execution-time. In particular, the output schema depends on the aggregates, which depend on their kernels, which depend on their functions, which depend on the function registry, which may differ (or at least be incompatible for output-schema purposes) between planning-time and execution-time. Looks like the best we could do is allow the function registry to be passed in at planning-time, but this would require changing the signature of a large number of Arrow Substrait functions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1494735294

   > Can you give an example of when does the output schema differs in "planning-time" vs "execution'time"?
   
   At execution-time, one can, for example, [pass a `FunctionRegistry` to `ExecuteSerializedPlan`](https://github.com/apache/arrow/blob/0e1f5c6c2ce7504d86dd10d91ab7d29c2441d4ce/cpp/src/arrow/engine/substrait/util.h#L54-L58) that is different than the default `GetFunctionRegistry()` one. The passed registry is allowed to override functions, add kernels, or completely redefine the set of registered functions - making this registry incompatible with the default one used in planning-time for output schema purposes. Granted, this would not be a usual use-case, but it is a supported one.
   
   My point is that it is a design problem when Acero supports a choice of function registry at execution-time but not at planning-time, because the two choices should be compatible for things to work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1493375560

   Can we simply standardize the output column ordering to be segment keys +
   group by keys + measurements?
   
   On Sun, Apr 2, 2023 at 11:18 AM rtpsw ***@***.***> wrote:
   
   > @westonpace <https://github.com/westonpace>, I've looked into this for a
   > while and things are confusing.
   >
   > First, I've found that Acero aggregation produces an output schema
   > different than the one you expect here, and in particular one that collects
   > measurements then keys then segment keys. Here are my observations:
   >
   >    - The ScalarAggregateNode class
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L255>,
   >    for which the keys are empty, gets an output scheme
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L377>
   >    that collects measurements
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L369>
   >    then segment keys
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L372-L373>
   >    .
   >    - The GroupByNode class
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L533>
   >    gets an output schema
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L663>
   >    that collects measurements
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L648-L649>
   >    then keys
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L654>
   >    then segment keys
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L659>
   >    .
   >    - A comment in GroupByNode
   >    <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/acero/aggregate_node.cc#L646>
   >    says: "Aggregate fields come before key fields to match the behavior of
   >    GroupBy function".
   >
   > Second, the output-schema-building Arrow Substrait code
   > <https://github.com/rtpsw/arrow/blob/adf33cc430101d1d6878b546e1473431ca5f280c/cpp/src/arrow/engine/substrait/relation_internal.cc#L334-L357>
   > you noted is trying to collect measurement then keys then segment keys.
   > However, as you noted, it looks like it collects the measurements from the
   > input schema instead of the output schema. I think the code should first
   > construct the output schema (which involves accessing the name field of
   > each Aggregate) and then collect from it.
   >
   > Third, the Substrait spec may have a different idea about the expected
   > field order of an aggregation. Assuming so, we should probably fix Acero to
   > use this order, especially since the GroupBy function mentioned in the
   > comment no longer exists <https://github.com/apache/arrow/issues/14866>.
   >
   > Fourth, I suspect that a deeper reason for why we get away with such bugs
   > is that Arrow Substrait generally doesn't care about field names, since
   > Substrait generally doesn't (until the final output), so as long as the
   > field type happens to be right then the test-case passes. Perhaps, besides
   > testing after-projection, we should make the existing test-cases produce
   > different types per field as much as possible.
   >
   > cc @icexelloss <https://github.com/icexelloss>
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/arrow/issues/34786#issuecomment-1493370198>,
   > or unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAGBXLFUCGOFVFMU23ZYMQLW7GKDLANCNFSM6AAAAAAWMXR3GU>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on issue #34786: [C++] Output schema calculated by Substrait consumer for aggregate rel seems incorrect.

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on issue #34786:
URL: https://github.com/apache/arrow/issues/34786#issuecomment-1494694461

   > Another design issue to notice is that the output schema may (though not usually) differ between planning-time and execution-time. In particular, the output schema depends on the aggregates, which depend on their kernels, which depend on the function registry for function resolution as well as on the exec-context for kernel output-type resolution. Both the function registry and the exec-context, which contains it, may differ (or at least be incompatible for output-schema purposes) between planning-time and execution-time. In particular, at planning-time the current code can only assume a default exec-context. Looks like the best we could do is allow the exec-context to be passed in at planning-time, but this would require changing the signature of a large number of Arrow Substrait functions.
   
   Can you give an example of when does the output schema differs in "planning-time" vs "execution'time"? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org