You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "rtpsw (via GitHub)" <gi...@apache.org> on 2023/04/28 05:47:12 UTC

[GitHub] [arrow] rtpsw opened a new pull request, #35364: GH-35363: [C++] Minor code cleanup

rtpsw opened a new pull request, #35364:
URL: https://github.com/apache/arrow/pull/35364

   See https://github.com/apache/arrow/issues/35363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180823651


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   @rtpsw I am happy to merge the code that fixes the the segmented aggregation bug  (didn't pass the segmented key). and the change that addes validation of number of output names However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure), so there is more work. I would suggest revert that change for now so we can move forward with the bug fix faster. (Since the naming change seems cosmetic)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1529676411

   Thanks @rtpsw - LGTM.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1528148435

   @rtpsw LGTM. Can you fix the lint error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180686957


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;
+      for (auto& field_ref : aggregate.target) {
+        ARROW_ASSIGN_OR_RAISE(auto field, field_ref.GetOne(*input_schema));
+        aggregate.name += "_" + field->name();

Review Comment:
   Does this replicate the logic for non-segmented aggregation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss merged pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss merged PR #35364:
URL: https://github.com/apache/arrow/pull/35364


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180784950


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   > Without this logic, the aggregate schema ends up with an empty name for each aggregate
   
   While empty name is not great. I wonder if it actually breaks anything - substrait doesn't use names to refer to intermediate columns so IIUC the names here doesn't matter for correctness. (Although I agree meanful names are better for debug).
   
   Does any code breaks without this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180831335


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180823651


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   @rtpsw I am happy to merge the code that fixes the the segmented aggregation bug (didn't pass the segmented key). However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure). I would suggest revert that change for now so we can move forward with the bug fix faster. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180815416


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   Right, not related to my changes herer. The invocations look like `CheckRoundTripResult` -> `AssertTablesEqualIgnoringOrder` -> `AssertTablesEqual`, and the latter (surprisingly, I guess) ignores column names for some reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1529481180

   @westonpace, @icexelloss: is this good to go? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35364: GH-35363: [C++] Minor code cleanup

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1527021181

   :warning: GitHub issue #35363 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180823651


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   @rtpsw I am happy to merge the code that fixes the the segmented aggregation bug (didn't pass the segmented key). However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure). I would suggest revert that change for now so we can move forward with the bug fix faster. (Since the naming change seems cosmetic)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180786128


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   These names here (A, B, C) doesn't match the “output_schema" below (A, A1, A+1). Is this expected?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180806897


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   I think you're right and this is mostly useful in debugging. However, note that the schema produced by the pre-PR code is inconsistent - aggregate columns have empty names while other column have "normal" names (derived from the input). So, this change make sense at least for consistency.
   
   > Does any code breaks without this change?
   
   No. However, after adding this logic I needed to change one test case that checks the aggregate column name. There must be a reason for the existence of this test case, but I don't know it. @westonpace ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180737166


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   Without this change, [this check](https://github.com/apache/arrow/pull/35364/files#diff-11e93118405fd9709ff4ffb62aa7fbbc6ff8ca5bee7aa34f8a0ae951360d7a6eR158-R161) raises. IIUC, the plan has 3 outputs because its root is a projection with 3 emit indices.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180786904


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   (Not that I expect your code change has anything to do with this, but looks quite confusing to me so I asked)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1530045025

   Benchmark runs are scheduled for baseline = 41adf00652bf6788581f9adc4c007af5c1de3d56 and contender = 3b48834989fdeb9cc561f11585da0f78f4ced48e. 3b48834989fdeb9cc561f11585da0f78f4ced48e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:25.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/b2c9ffcaed13464ea088e00c70ad23a9...8c1b8bd2b326475eafe3a748eb9ee358/)
   [Failed :arrow_down:1.81% :arrow_up:0.18%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/2e705648fbb24b049e771a818795d9d8...a0acee3c8e804e61a15b1ce7b8931ae2/)
   [Failed :arrow_down:0.26% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f1433e6901184586b77b4b189ffa4137...1cc66a355eaa4b4da6b45035963cdc97/)
   [Finished :arrow_down:1.72% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/4190c6fe3df44336889856760f8569a7...c6b88fd20d8247a39e16284cc472b9a7/)
   Buildkite builds:
   [Finished] [`3b488349` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2801)
   [Failed] [`3b488349` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2835)
   [Failed] [`3b488349` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2799)
   [Finished] [`3b488349` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2826)
   [Finished] [`41adf006` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2800)
   [Failed] [`41adf006` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2834)
   [Failed] [`41adf006` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2798)
   [Finished] [`41adf006` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2825)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35364: GH-35363: [C++] Minor code cleanup

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1527021165

   * Closes: #35363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180785628


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180756108


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;
+      for (auto& field_ref : aggregate.target) {
+        ARROW_ASSIGN_OR_RAISE(auto field, field_ref.GetOne(*input_schema));
+        aggregate.name += "_" + field->name();

Review Comment:
   Good point. I'll make this consistent shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180741744


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   Without this logic, the aggregate schema ends up with an empty name for each aggregate. This is because the [aggregate decoding](https://github.com/apache/arrow/blob/16dbd98e8f525884649e5a33096c74a455f64610/cpp/src/arrow/engine/substrait/extension_set.cc#L929) is [returning an aggregate with an empty name](https://github.com/apache/arrow/blob/16dbd98e8f525884649e5a33096c74a455f64610/cpp/src/arrow/engine/substrait/extension_set.cc#L980-L981), which in turn is likely because it has no access to the input schema against which to resolve the field reference (`*arg_ref`). The code here does have access to this schema and resolves the field reference.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] rtpsw commented on pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "rtpsw (via GitHub)" <gi...@apache.org>.
rtpsw commented on PR #35364:
URL: https://github.com/apache/arrow/pull/35364#issuecomment-1527380166

   cc @westonpace, @icexelloss 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180690023


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   Why do we need this logic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180689627


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   This seems wrong. The plan has 4 output columns:
   direct reference 0, 1, 2 and a new column with scalar Function



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180688412


##########
cpp/src/arrow/engine/substrait/serde_test.cc:
##########
@@ -3937,7 +3937,7 @@ TEST(Substrait, ProjectWithMultiFieldExpressions) {
             }]
           }
         },
-        "names": ["A", "B", "C", "D"]

Review Comment:
   Why this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] icexelloss commented on a diff in pull request #35364: GH-35363: [C++] Fix Substrait schema names and for segmented aggregation

Posted by "icexelloss (via GitHub)" <gi...@apache.org>.
icexelloss commented on code in PR #35364:
URL: https://github.com/apache/arrow/pull/35364#discussion_r1180823651


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -205,12 +205,17 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
       ARROW_ASSIGN_OR_RAISE(auto aggregate, internal::ParseAggregateMeasure(
                                                 agg_measure, ext_set, conv_opts,
                                                 /*is_hash=*/!keys.empty(), input_schema));
+      aggregate.name = aggregate.function;

Review Comment:
   @rtpsw I am happy to merge the code that fixes the the segmented aggregation bug  (didn't pass the segmented key). and the change that addes validation of number of output names However for the internal aggregation names change we would need validate that this naming is consistent with other aggregation codepath and/or refactor the naming logic to be shared (e.g., move that logic into ParseAggregateMeasure). I would suggest revert that change for now so we can move forward with the bug fix faster. (Since the naming change seems cosmetic)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org