You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/22 15:32:38 UTC

[GitHub] [arrow] westonpace commented on a diff in pull request #34627: GH-34626: [C++] Add ordered/segmented aggregation Substrait extension

westonpace commented on code in PR #34627:
URL: https://github.com/apache/arrow/pull/34627#discussion_r1145005055


##########
cpp/src/arrow/engine/substrait/options.cc:
##########
@@ -166,6 +171,57 @@ class DefaultExtensionProvider : public BaseExtensionProvider {
                                      named_tap_rel.name(), std::move(renamed_schema)));
     return RelationInfo{{std::move(decl), std::move(renamed_schema)}, std::nullopt};
   }
+
+  Result<RelationInfo> MakeSegmentedAggregateRel(
+      const ConversionOptions& conv_opts, const std::vector<DeclarationInfo>& inputs,
+      const substrait_ext::SegmentedAggregateRel& seg_agg_rel,
+      const ExtensionSet& ext_set) {
+    if (inputs.size() != 1) {
+      return Status::Invalid(
+          "substrait_ext::SegmentedAggregateRel requires a single input but got: ",
+          inputs.size());
+    }
+
+    auto input_schema = inputs[0].output_schema;
+
+    ConversionOptions conversion_options;
+
+    // store segment key fields to be used when output schema is created
+    std::vector<int> segment_key_field_ids;
+    std::vector<FieldRef> segment_keys;
+    if (seg_agg_rel.segment_groupings_size() > 0) {
+      ARROW_RETURN_NOT_OK(internal::ParseAggregateGrouping(
+          seg_agg_rel.segment_groupings(0), ext_set, conversion_options, input_schema,
+          &segment_key_field_ids, &segment_keys));
+    }
+
+    const auto& aggregate = seg_agg_rel.aggregate();
+    ARROW_ASSIGN_OR_RAISE(
+        auto decl_info,
+        internal::ParseAggregateDeclaration(

Review Comment:
   In Substrait itself we have been discouraging this kind of approach when creating physical relations because:
   
    * It's too expressive - We don't consume all kinds of AggregateRel (e.g. expressions in a grouping have to be direct references) and, since this is a physical relation, we should only expose what we can consume.
    * Unnecessary coupling - It's not possible to change AggregateRel without potentially changing all the extensions and it's not clear they would always need to change.
    * Directly including the Rel itself leads to some awkwardness like the fact that you now have multiple "inputs".
   
   However, yes, most of the parsing code would then have to be duplicated.  So for something internal I don't know that it is unworkable.  So I don't have a strong opinion but I would lean slightly towards something like...
   
   ```
   message SegmentedAggregateRel {
     repeated Expression.FieldReference grouping_keys = 0;
     repeated Expression.FieldReference segment_keys = 1;
     repeated substrait.AggregateRel.Measure measures = 2;
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org