You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/26 03:40:00 UTC

[jira] [Commented] (ARROW-15582) [C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions)

    [ https://issues.apache.org/jira/browse/ARROW-15582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527868#comment-17527868 ] 

Weston Pace commented on ARROW-15582:
-------------------------------------

There are <100 "standard" Substrait functions right now but this list will probably grow.  In general I do not think it is safe to assume that Substrait functions & Arrow functions will share the same name.  Even if two functions do exist with the same name I don't think it's safe to assume they will have the same behavior.  I think some kind of "mapping" object is going to have to be maintained.

At a minimum one would think this mapping object would be a simple bidirectional string:string map which goes from Arrow function name to Substrait function name and back.  Unfortunately, as the ticket describes, I do not think this is possible today.

The worst case scenario is that we require two functions for every entry in the mapping.  One that goes from a Substrait "call" to an Arrow "call" and the reverse.  I think, as a first attempt, we should tackle this with a very manual mapping, probably with some kind of convenience option for the functions that are simple aliases and then we can look at how we improve from there.

A substrait "call" is a name (string), a vector of arguments (expressions), and a vector of options (literal expressions).  An arrow "call" is a name (string), a vector of arguments (expressions), and an options object (POCO).

So my suggestion for the mapping would be something like...

{noformat}
using ArrowToSubstrait = std::function<substrait::Expression::ScalarFunction(const arrow::compute::Expression::Call&, std::vector<substrait::Expression>)>;
using SubstraitToArrow = std::function<arrow::compute::Expression::Call(const substrait::Expression::ScalarFunction&)>;
class FunctionMapping {

  // Registration API
  AddArrowToSubstrait(std::string arrow_function_name, ArrowToSubstrait conversion_func);
  AddSubstraitToArrow(std::string substrait_function_name, SubstraitToArrow conversion_func);

  // Usage API
  substrait::Expression::ScalarFunction ToProto(const arrow::compute::Expression::Call& call);
  arrow::compute::Expression::Call FromProto(const substrait::Expression::ScalarFunction& call);
};
{noformat}

The add function is an interesting example (some pseudo-code / imaginary helper functions for brevity):

{noformat}
SubstraitToArrow substrait_add_to_arrow = [] (const substrait::Expression::ScalarFunction& call)  {
  // Note, Substrait scalar functions don't distinguish between options and arguments so the
  // index of this option is 2 because it comes after the operands (at index 0 and 1).
  // This is why we have to specify how many args there are in the GetArgs invocation.
  auto args = GetArgs(call, 2);
  EnumLiteral overflow_handling = GetOption<EnumLiteral>(call, 2);
  if (IsSpecified(overflow_handling)) {
    switch (GetEnumValue(overflow_handling)) {
      case "SILENT":
        return call("add", args);
      case "SATURATE":
        return Status::Invalid("Arrow does not have a saturating add");
      case "ERROR":
        return call("add_checked", args);
    }
  } else {
    // Default to unchecked add because SILENT => unchecked and SILENT
    // is the first option in the enum (and thus the highest priority when
    // not specified)
    return call("add", args);
  }
};
// Note, we can automatically do the conversion from arrow args to Substrait args because
// we distinguish between args and options in Arrow.
ArrowToSubstrait arrow_add_to_substrait = [] (const arrow::compute::Expression::Call& call, std::vector<substrait::Expression> args) {
  var overflow_behavior = MakeEnum("ERROR");
  var all_args = Concat(std::move(args), {overflow_behavior});
  return MakeSubstraitCall("add", std::move(all_args));
};
ArrowToSubstrait arrow_unchecked_add_to_substrait = [] (const arrow::compute::Expression::Call& call, std::vector<substrait::Expression> args) {
  var overflow_behavior = MakeEnum("SILENT");
  var all_args = Concat(std::move(args), {overflow_behavior});
  return MakeSubstraitCall("add", std::move(all_args));
};
function_mapping.AddSubstraitToArrow("add", substrait_add_to_arrow);
function_mapping.AddArrowToSubstrait("add", arrow_add_to_substrait);
function_mapping.AddArrowToSubstrait("add_unchecked", arrow_add_unchecked_to_substrait);
{noformat}

> [C++] Add support for registering tricky functions with the Substrait consumer (or add a bunch of substrait meta functions)
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-15582
>                 URL: https://issues.apache.org/jira/browse/ARROW-15582
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>              Labels: substrait
>
> Sometimes one Substrait function will map to multiple Arrow functions.  For example, the Substrait {{add}} function might be referring to Arrow's {{add}} or {{add_checked}}.  We need to figure out how to register this correctly (e.g. one possible approach would be a {{substrait_add}} meta function).
> Other times a substrait function will encode something Arrow considers an "option" as a function argument.  For example, the is_in Arrow function is unary with an option for the lookup set.  The substrait function is binary but the second argument must be constant and be the lookup set.  Neither of which is to be confused with a truly binary is_in function which takes in a different set at every row.
> It's possible there is no work to do here other than adding a bunch of substrait_ meta functions in Arrow.  In that case all the work will be done in other JIRAs.  Or, it is possible that there is some kind of extension we can make to the function registry that bypasses the need for the meta functions.  I'm leaving this JIRA open so future contributors can consider this second option.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)