You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 00:13:16 UTC

[GitHub] [arrow] wesm opened a new pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

wesm opened a new pull request #10856:
URL: https://github.com/apache/arrow/pull/10856


   A rough sketch of a data representation to communicate computational expressions against Arrow array-like and table-like data. This is not comprehensive and will need many changes/additions to be able to express the capabilities of existing production systems. 
   
   See accompanying mailing list discussion for context!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692167351



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.

Review comment:
       I think a `BETWEEN` function could be canonicalized in https://github.com/apache/arrow/pull/10934 if desired. OTOH, it doesn't really need to be since it fits neatly into `Call`s and a consumer can choose to do what it wants with that (rewrite it, leave it as is, etc.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681416940



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,

Review comment:
       we will need a definition of equality (each implementation currently defines their own in the integration tests)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682596358



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);

Review comment:
       "string" in Flatbuffers is UTF-8, is that enough of a constraint?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,

Review comment:
       Yes, `/` vs. `//` (`__floordiv__`) in Python. Will add

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,

Review comment:
       Good question. I didn't think so but let's check

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}

Review comment:
       There is the `NonEqualityJoin` which allows for arbitrary expressions. This could be collapsed to be just a single expression-based join and leave it to the engine to decide how to execute the join




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm closed pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm closed pull request #10856:
URL: https://github.com/apache/arrow/pull/10856


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Jimexist commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r684646226



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {

Review comment:
       would specifying arity here be too stringent?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692165082



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).

Review comment:
       Per the Google doc, types will be explicit in the IR. Resolution of types should occur on the producer side.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681385921



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,

Review comment:
       i forget does, zero need to be reserved in flatbuffer for forward compatibility checks?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#issuecomment-925088150


   Closing indeed. Thank you all


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682833506



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.

Review comment:
       often there is a distinction made between a "scalar function" which produces one output row for each input row, aggregates which produces a single output row for all input rows and "table function" that produces something in between.
   
   Given that `ArrayFunction` is used in `Filter` below it seems like ArrayFunction must be a 'scalar function' by the above definition. 
   
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;

Review comment:
       I would expect WindowFrame to appear on the definition of some relational node that was responsible for organizing the data per the frame definition and then calling a function that produced one output row for each window of data in the input

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];

Review comment:
       Most IRs I have seen model this as a Filter after the Aggregate (and then physical implementations might push the having expressions into the specific aggregate operator)

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {

Review comment:
       It is also possible to model things like `IS NOT NULL` as a unary function rather than a unique node in the IR if we want to reduce the number of types in this tree
   
   For example, you could model `IS NOT NULL <column>` like `ArrayFunction(descr="IsNull", inputs=[colum]` which perhaps is what @pitrou  is suggesting in  https://github.com/apache/arrow/pull/10856/files#r681583577
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}

Review comment:
       SQL style joins allow any arbitrary prediate (often called `on_exprs` as they appear in the `ON` clause). The break out of left/right columns for equijoin is almost always required for performance reasons, but there can be other predicates). 
   
   For example here is a query you can not represent using the Compute IR in this PR:
   
   ```sql
   SELECT * 
   FROM 
     orders LEFT JOIN lineitem ON (l_orderkey = o_orderkey AND l_comments LIKE '%one star%'`
   ```
    which would produce a values for all orders, even if they didn't have any "one star" reviews




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681385279



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);

Review comment:
       are any restriction placed on the contents?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683053333



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;

Review comment:
       Be sure to specify the null semantics you intend here. SQL's `NOT IN` has horrendous semantics if input is null or or in_exprs contains nulls. And people will want `IsIn` to implement those semantics.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692188796



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,

Review comment:
       #10934 has a canonical set, and allows for user defined join types. Should these be canonicalized?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692182740



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);

Review comment:
       Yup, both keys and aggregates are optional in #10934.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681385680



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,

Review comment:
       should integer division be a separate operation then floating point division?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683047963



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {

Review comment:
       consider renaming to `Project`. `Filter`, `Aggregate` are verbs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692172728



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);

Review comment:
       Good point. I will make sure to incorporate that into windows in #10934 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692185880



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;

Review comment:
       Good call. Thank you.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682596358



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);

Review comment:
       "string" in Flatbuffers is UTF-8, is that enough of a constraint?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683049332



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];

Review comment:
       +1 to remove `having`. `HAVING` was added to SQL only because SQL at the time didn't allow nested queries.
   
   We can talk later about having 'fused' operators (e.g. `Project` followed by `Aggregate` followed by `Filter`) but let's keep the core operators minimal.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692169651



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}

Review comment:
       See https://github.com/apache/arrow/pull/10934/files#diff-36ffcd270fce14cd204af5b0224821cf9c2cf5aff6c4885cb84553b640dd86f8R120. There are a number of standard calls (including arithmetic and logical operators such as and)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#issuecomment-891414580


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Mytherin commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

Mytherin commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r687105936



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;

Review comment:
       OFFSET should also be added.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;

Review comment:
       In certain database engines, LIMIT and OFFSET are not limited to scalar values. e.g. in Postgres (and DuckDB) the following is valid SQL:
   
   ```sql
   SELECT *
   FROM integers
   LIMIT
     (SELECT min(i)
      FROM integers);
   ```
   
   Modeling this is not super critical but perhaps something to keep in mind.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,

Review comment:
       In order to support correlated subqueries in SQL, I recommend extending the join types at least with the SINGLE join and the MARK join. To support truly arbitrary correlated subqueries we also need to model a DEPENDENT join, but this is more difficult to model as it is not a standard relational join. 
   
   See also these two papers: [Unnesting Arbitrary Queries](https://cs.emis.de/LNI/Proceedings/Proceedings241/383.pdf) and [The Complete Story of Joins (in HyPer)](http://btw2017.informatik.uni-stuttgart.de/slidesandpapers/F1-10-37/paper_web.pdf).
   
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);

Review comment:
       Window frames do not necessarily have one PRECEDING and one FOLLOWING clause, e.g. the following is valid SQL:
   
   ```sql
   SELECT i, SUM(i) OVER (ORDER BY i ROWS BETWEEN 2 FOLLOWING AND 4 FOLLOWING)
   FROM range(10) tbl(i);
   ```
   
   See [here](https://duckdb.org/docs/sql/window_functions) for the frame spec used in DuckDB, and [here](https://www.sqlite.org/syntax/frame-spec.html) for the frame spec used in SQLite.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;

Review comment:
       Window expressions should always be functions, e.g. I wouldn't know the semantics of a query like this:
   
   ```sql
   SELECT colname OVER (...)
   FROM tbl
   ```
   
   I suppose it would just return `colname`, but then the frame is not very useful.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction

Review comment:
       At least in DuckDB Window expressions are extracted into a separate operator much like Aggregate expressions, e.g.:
   
   ```sql
   explain SELECT depname, empno, salary, sum(salary) OVER (PARTITION BY depname ORDER BY empno) FROM empsalary ORDER BY depname, empno
   ┌───────────────────────────┐
   │          ORDER_BY         │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │           #0 ASC          │
   │           #1 ASC          │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │           WINDOW          │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │ sum(salary) OVER(PARTITION│
   │  BY depname ORDER BY empno│
   │      ASC NULLS FIRST)     │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │          SEQ_SCAN         │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │         empsalary         │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │          depname          │
   │           empno           │
   │           salary          │
   └───────────────────────────┘ 
   ```
   
   Since Window expressions are evaluated in a completely different manner than regular scalar functions, perhaps that is also a good idea here.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).

Review comment:
       I agree that types seem out of place here. The way I see this IR is that it is a way of writing SQL without requiring a parser - i.e. the nodes in the operator tree are defined, but nothing is yet resolved or bound. Types are not resolved, columns and functions exist only as strings and are not resolved to the catalog, etc.
   
   So instead of writing `SELECT i+1 FROM table WHERE i>3`
   I would write `PROJECT(i + 1, FILTER(i > 3, TABLE(table)))`
   Much like with the raw SQL string nothing is known with regards to types in this stage.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];

Review comment:
       Is the `having` clause here necessary? Having is a filter that refers to the output of the aggregate, so it is possible to model this as a Filter stacked onto the Aggregate.
   
   There should be no need for these two queries to be different after parsing:
   
   ```sql
   SELECT a, SUM(b) FROM tbl GROUP BY a HAVING SUM(b) < 10;
   SELECT * FROM (SELECT a, SUM(b) FROM tbl GROUP BY a) tbl(a, sum_b) WHERE sum_b < 10;
   ```

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {

Review comment:
       SortKey is present but the ORDER node itself appears to be missing.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);

Review comment:
       The LiteralVector appears to be identical to a Literal of type List (i.e. ListLiteralData), are both required?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction

Review comment:
       Perhaps we want to add a table of literals, e.g. the VALUES clause in SQL:
   
   ```sql
   SELECT * FROM (VALUES (1, 'hello'), (2, 'world')) tbl(id, name);
   ```

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);

Review comment:
       ColumnReferences need to be able to refer to columns outside of only the base tables as well. For example, suppose I have the following SQL query:
   
   ```sql
   SELECT a+1 FROM (SELECT col + col2 FROM tbl) subquery(a)
   ```
   
   My query plan looks like this:
   
   ```sql
   ┌───────────────────────────┐
   │         PROJECTION        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │          +(a, 1)          │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │         PROJECTION        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │        +(col, col2)       │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │          SEQ_SCAN         │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │            tbl            │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │            col            │
   │            col2           │
   └───────────────────────────┘   
   ```
   
   In order for this to work, the first projection will need to keep track of the names of the output columns (e.g. `a` in this case). Unless I am missing something that appears to not be the case right now. 
   
   The same is true for aggregates and groups:
   
   ```sql
   SELECT a+1 FROM (SELECT SUM(col) FROM tbl) subquery(a);
   
   ┌───────────────────────────┐
   │         PROJECTION        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │          +(a, 1)          │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │      SIMPLE_AGGREGATE     │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │          sum(col)         │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │         PROJECTION        │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │            col            │
   └─────────────┬─────────────┘                             
   ┌─────────────┴─────────────┐
   │          SEQ_SCAN         │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │            tbl            │
   │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
   │            col            │
   └───────────────────────────┘  
   ```
   
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction
+}

Review comment:
       `DISTINCT` can be a separate node, but it can also be modeled as an aggregate node without aggregate expressions, e.g. the following two queries are equivalent:
   
   ```sql
   SELECT DISTINCT a, b FROM tbl;
   SELECT a, b FROM tbl GROUP BY a, b;
   ```

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction

Review comment:
       For nested types we also support an `UNNEST` operator. Unnest extracts the elements from a list into separate rows and repeats non-list elements, e.g.:
   
   ```sql
   select 1, unnest([1, 2, 3]);
   ┌───┬─────────────────────────────┐
   │ 1 │ unnest(list_value(1, 2, 3)) │
   ├───┼─────────────────────────────┤
   │ 1 │ 1                           │
   │ 1 │ 2                           │
   │ 1 │ 3                           │
   └───┴─────────────────────────────┘
   ```
   
   This is also a separate operator in DuckDB. As it changes the cardinality of the source tree it does not fit nicely into a projection node. Postgres and other database systems also support this operation, but I'm not sure how they implement this internally.
   
   This could also be modeled as a generic `TABLE IN -> TABLE OUT` function.
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);

Review comment:
       In SQL aggregate functions are slightly different from normal functions as they can have additional clauses, specifically:
   
   * DISTINCT (e.g. `COUNT(DISTINCT col)`)
   * FILTER (e.g. `COUNT(col) FILTER (col2=3)`)
   * ORDER (e.g. `STRING_AGG(col ORDER BY col2)`)
   
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction

Review comment:
       SetOperations should also be added (UNION, EXCEPT, INTERSECT)

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.

Review comment:
       `BETWEEN` is syntactic sugar in SQL, but also exists internally in engines as a performance optimization for doing `x >= l AND x <= l` in a single function. Whether or not it should be included depends on the goal of the IR, in my opinion. If the IR's goal is to simply specify *what* to execute, then it is unnecessary since it can be replaced with the comparison statements and a conjunction. If the IR's goal is to specify *how* to execute as well (i.e. "here is an already optimized plan, run that") then `BETWEEN` should be included since it is part of the optimized plan.
   
   In case of the latter, we might also want to include other nodes that are not strictly necessary but are the result of optimizations. For example, we could have a `Top-N` node (optimized variant of ORDER + LIMIT).

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {

Review comment:
       If the goal is to serialize an already optimized plan, we might want to think about adding e.g. projection and filter pushdown information into the table scans. 

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}

Review comment:
       While I fully agree that all other operators can be modeled as functions, there needs to be some form of standardisation on `what is an addition` and `what is an AND conjunction`, otherwise the interoperability of the IR is lost. For example, if System A calls their AND clause `conjunction_and` and System B calls it `and`, the compatibility between the systems is lost. 

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);

Review comment:
       Aggregates should be optional, it is possible to do a grouping without aggregate expressions:
   
   ```sql
   SELECT a, b FROM tbl GROUP BY a, b;
   ```
   
   But then the aggregate node needs either at least one aggregate to be defined, or at least one group to be defined (i.e. they can't both be empty). Not sure how to cleanly model that here. A possible solution could be to turn the ungrouped aggregate into a different node (e.g. `GroupedAggregate` and `UngroupedAggregate`).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Jimexist commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r684645994



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,

Review comment:
       window functions are complex, maybe can be phased in next version?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r685319801



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {

Review comment:
       I think I'm not totally clear on what computation this is modelling. Is it a struct field literal, a `1 as foo` construct, or something else?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);

Review comment:
       Why is this field called `value`, while the rest of the `LiteralData` typed fields are called `data`?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required

Review comment:
       Is this field necessary? Isn't the type information captured in the `LiteralData` enum?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;

Review comment:
       Why does this default to `SCALAR`? Per my above comment I'm not totally sure what the purpose of `FunctionType` is. A function could have `type == SCALAR`, but does that mean it takes a scalar an returns a scalar or some combination of the two?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).

Review comment:
       I think we should stipulate the semantics of a non-zero, non-one value in the byte, i.e., "it's undefined" or "anything non-zero is considered `true`".

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key:ArrayExpr (required);
+  ascending:bool = true;

Review comment:
       This is one of the API choices I really dislike about pandas, and think SQL did well: a specific language construct for indicating sort order. I personally find it really annoying to have to compute a logical negation in my head to get descending behavior, and I think about code in pandas like `df.sort_values([("a", False), ("b", True)])` which actually requires me to know what the heck the bools correspond to there, as opposed to an enum like `Order.Asc`, which reads clear.
   
   IMO we should do the same and have an `enum SortOrder { Ascending, Descending }` and replace the `ascending` with an `order` field.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.

Review comment:
       I think another reason to avoid specifying this operator altogether is that the IR producer can provide whatever API it wants, and produce the right comparison operations.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {

Review comment:
       It's not clear to me what this `FunctionType` enum means. Does it imply something about the input type, output type, or both?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;

Review comment:
       More and more I think we should make this required. Maybe there's a use case where the IR is still useful, but this field is null?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];

Review comment:
       What is the use case for this? It sort of looks like a Python keyword argument, but maybe it's something else?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);

Review comment:
       Tiniest of nits: can we call this `predicate`?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).

Review comment:
       I think this is inconsistent with the decision to not let any details about input type rules and checking into the IR for no clear gain. There isn't really a reason to assume the input and output types are equal because the output type is omitted.
   
   Users won't be interacting with the IR directly, so there's no reason to add in anything convenient for IR producers/consumers. The producers can implement the elision above the IR, and IMO consumers should never receive IR that is missing types.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,

Review comment:
       Doesn't introducing this leak a Python language detail into something that is language agnostic?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key:ArrayExpr (required);
+  ascending:bool = true;
+}
+
+/// A (possibly hierarchical) sort operation with one or more sort keys.
+table Sort {
+  // TODO
+  keys:[SortKey] (required);
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr:FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema:Schema;
+}
+
+union TableOperation {
+  ExternalTable,

Review comment:
       Doesn't this get into DDL land? I thought DDL was explicitly out of scope for IR (for now). Additionally, `ExternalTable` implies something about how to access the data (even if very little about how to do that), and I think we already declared that consumers basically have access to a table name and Arrow schema, and it's completely on them to figure out how to access the data. I think we should remove this operation.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key:ArrayExpr (required);
+  ascending:bool = true;
+}
+
+/// A (possibly hierarchical) sort operation with one or more sort keys.
+table Sort {

Review comment:
       I don't think this is necessary, `[SortKey]` covers the use case.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {

Review comment:
       Yup, just to list these out explicitly:
   
   1. `RANGE` vs `ROWS` clauses
   1. ordering key expression: (`ORDER BY`), where the element type has an ordering not including `NULL`s.
   2. partitioning key expression (`PARTITION BY`)
   3. preceding expression (a union of an expression and an `UNBOUNDED` constant)
   4. following expression (same type as preceding)
   5. the ability to dictate how to order `NULL`s (e.g., `NULLS FIRST`, `NULLS LAST`)

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}

Review comment:
       I think having a single join construct for now makes the most sense. I don't see a reason to differentiate in the IR, since concerns about types of joins are IMO out of scope for the IR and I think in every case would fall solely onto the consumer to differentiate.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection

Review comment:
       Thinking about window functions more, this is where they'd go IMO. You'd have something like a `WindowedExpr`:
   
   ```flatbuffers
   table WindowedExpr {
     func: ArrayFunction (required);
     frame: Frame;
   }
   
   enum Clause {
     Rows,
     Range,
   }
   
   enum NullOrdering {
     First,
     Last,
   }
   
   enum Unbounded {}
   
   union Bound {
     ArrayExpr,
     // not entirely sure how to model a sum type with an empty variant in flatbuffers
     Unbounded,
   }
   
   table Frame {
     clause: Clause;
     partition: [ArrayExpr];
     order: [SortKey];
     preceding: Bound;
     following: Bound;
   }
   ```

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key:ArrayExpr (required);
+  ascending:bool = true;
+}
+
+/// A (possibly hierarchical) sort operation with one or more sort keys.
+table Sort {
+  // TODO
+  keys:[SortKey] (required);
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr:FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.

Review comment:
       What would a consumer do given a missing `out_schema`? I think there's no ambiguity if we stipulate that the consumer must conform to the output type.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}

Review comment:
       I agree with that. I think special casing some set of operations is probably not worth the effort, unless a function really truly is special in some way. It looks like all of the current functions are N-ary M -> M functions, so we can model things like `IsIn`, `IfElse`, etc using the generic structures.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;

Review comment:
       The function and whether it's operating over a window are orthogonal concerns. An aggregate can be applied over a window with the function itself needing to know anything about how it's being applied.
   
   What I've seen and done previously is having 3 classes of functions: 1) `N -> N` (I think @alamb calls this a scalar function in a previous comment), 2) `N -> 1` aggregate functions, and 3) tabular functions (relation to relation)
   
   Windows can be applied over specialized window-only functions typically called "analytic" functions (rank-based, quantiles and a few others), and aggregates.
   
   I think that window functions are sufficiently complex that we may want to avoid them in the prototype phase of this.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {

Review comment:
       This should definitely be in there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681415445



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression

Review comment:
       ```suggestion
   /// Operation flips true/false values in boolean expression preserving validity
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692185590



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];

Review comment:
       Having isn't part of #10934. If a producer wants to provide a `having` API then it's up to them to generate the necessary IR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Jimexist commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r684645940



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {

Review comment:
       null first and null last?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681583577



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}

Review comment:
       Why do we have hardcoded functions (BinaryOp, IfElse...) in addition to ArrayFunction? This seems a bit like pointless complication.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681449081



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);

Review comment:
       In DataFusion this is `(optional)`, to coupe with the idea of a typed `NULL`, just like individual items of arrow arrays. AFAIK this would enable optional function arguments.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692190596



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction

Review comment:
       This is supported in #10934 with https://github.com/apache/arrow/pull/10934/files#diff-36ffcd270fce14cd204af5b0224821cf9c2cf5aff6c4885cb84553b640dd86f8R186




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692170370



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;

Review comment:
       Yup. I haven't yet added windows to #10934, but I plan on doing that soon.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#issuecomment-949672069


   @baibaichen Yup, the substrait effort came out of arrow-dev mailing list discussions around arrow compute IR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683052587



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.

Review comment:
       Probably not necessary. In practice you need Closed-Closed, Open-Closed, and Closed-Open just as often as you need Open-Open.
   
   If you want to go in this direction, look at Calcite Sarg or Guava [RangeSet](https://guava.dev/releases/23.0/api/docs/com/google/common/collect/RangeSet.html). You can model ordered lists of points and intervals. Perfect for range-scans of indexes or sorted lists, spatial operations, etc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681581497



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).

Review comment:
       What endianness is this expressed in?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wmalpica commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wmalpica commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681829802



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {

Review comment:
       For the WindowFrame to fully support SQL window functions we would also need to be able to specify:
   If the window frame is ROWS based or RANGE based, and the start and end of the window frame, where the start or end can also be unbounded. https://www.sqltutorial.org/sql-window-functions/

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {

Review comment:
       We probably want to have a `UnaryOp` and `UnaryOpType` the way you have for `BinaryOp` below, and have `IsNull`, `IsNotNull`, `Not` and `Negate` be `UnaryOpType`s. We will need to have a lot more of these, for example string functions such as `UpperCase` and `LowerCase` just to mention a pair.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681449821



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);

Review comment:
       It may be worthwhile skimming through the [function signatures](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/functions.rs#L61) declared in DataFusion.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681583998



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}

Review comment:
       Ideally, this would be `union ArrayOperation { ColumnReference, Literal, ArrayFunction }`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683050439



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection

Review comment:
       I think that `StarSelection` is unnecessary syntactic sugar. Will the star be expanded based on alphabetical column names or column ordinals?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kkraus14 commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

kkraus14 commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683802355



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;

Review comment:
       Should we add the matching behavior with regards to whether it searches for the previous, the next, or the nearest value?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683047963



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {

Review comment:
       consider renaming to `Project`. `Filter`, `Aggregate` are verbs.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,

Review comment:
       rename `OUTER` to `FULL`

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];

Review comment:
       +1 to remove `having`. `HAVING` was added to SQL only because SQL at the time didn't allow nested queries.
   
   We can talk later about having 'fused' operators (e.g. `Project` followed by `Aggregate` followed by `Filter`) but let's keep the core operators minimal.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {

Review comment:
       Let's not use the word 'selection'. It will only confuse. In academic relational algebra 'select' means 'Filter'.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection

Review comment:
       I think that `StarSelection` is unnecessary syntactic sugar. Will the star be expanded based on alphabetical column names or column ordinals?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.

Review comment:
       Probably not necessary. In practice you need Closed-Closed, Open-Closed, and Closed-Open just as often as you need Open-Open.
   
   If you want to go in this direction, look at Calcite Sarg or Guava [RangeSet](https://guava.dev/releases/23.0/api/docs/com/google/common/collect/RangeSet.html). You can model ordered lists of points and intervals. Perfect for range-scans of indexes or sorted lists, spatial operations, etc.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;

Review comment:
       Be sure to specify the null semantics you intend here. SQL's `NOT IN` has horrendous semantics if input is null or or in_exprs contains nulls. And people will want `IsIn` to implement those semantics.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorgecarleitao commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681415906



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,

Review comment:
       Is this the [kleene](https://en.wikipedia.org/wiki/Three-valued_logic) or non-kleene? E.g. Datafusion (and postgres) use kleene.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] baibaichen commented on pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

baibaichen commented on pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#issuecomment-949610252


   hi @wesm  what is the difference between https://github.com/substrait-io/substrait and this.
   
   it looks like they are doing the same thing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Jimexist commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

Jimexist commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r684645702



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3

Review comment:
       what does this mean?
   
   can a function that generates a seres be included in this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] houqp commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681486225



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);

Review comment:
       i think a similar question can be asked for `name` in TableReference as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683048143



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,

Review comment:
       rename `OUTER` to `FULL`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682833506



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.

Review comment:
       often there is a distinction made between a "scalar function" which produces one output row for each input row, aggregates which produces a single output row for all input rows and "table function" that produces something in between.
   
   Given that `ArrayFunction` is used in `Filter` below it seems like ArrayFunction must be a 'scalar function' by the above definition. 
   
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;

Review comment:
       I would expect WindowFrame to appear on the definition of some relational node that was responsible for organizing the data per the frame definition and then calling a function that produced one output row for each window of data in the input

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];

Review comment:
       Most IRs I have seen model this as a Filter after the Aggregate (and then physical implementations might push the having expressions into the specific aggregate operator)

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {

Review comment:
       It is also possible to model things like `IS NOT NULL` as a unary function rather than a unique node in the IR if we want to reduce the number of types in this tree
   
   For example, you could model `IS NOT NULL <column>` like `ArrayFunction(descr="IsNull", inputs=[colum]` which perhaps is what @pitrou  is suggesting in  https://github.com/apache/arrow/pull/10856/files#r681583577
   

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}

Review comment:
       SQL style joins allow any arbitrary prediate (often called `on_exprs` as they appear in the `ON` clause). The break out of left/right columns for equijoin is almost always required for performance reasons, but there can be other predicates). 
   
   For example here is a query you can not represent using the Compute IR in this PR:
   
   ```sql
   SELECT * 
   FROM 
     orders LEFT JOIN lineitem ON (l_orderkey = o_orderkey AND l_comments LIKE '%one star%'`
   ```
    which would produce a values for all orders, even if they didn't have any "one star" reviews

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {

Review comment:
       It is also possible to model things like `IS NOT NULL` as a unary function rather than a unique node in the IR if we want to reduce the number of types in this tree
   
   For example, you could model `IS NOT NULL <column>` like `ArrayFunction(descr="IsNull", inputs=[colum])` which perhaps is what @pitrou  is suggesting in  https://github.com/apache/arrow/pull/10856/files#r681583577
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681388356



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];

Review comment:
       At least I thin these are trying to map the same concepts, I could be mistaken.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682597052



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,

Review comment:
       Good question. I didn't think so but let's check




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682596714



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,

Review comment:
       Yes, `/` vs. `//` (`__floordiv__`) in Python. Will add




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r685522985



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs: [ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having: [ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input: TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit: long;
+}
+
+// The kind of join being produced
+enum RelationalJoinType : int {
+  INNER,
+  LEFT,
+  RIGHT,
+  FULL,
+  SEMI,
+  ANTI,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table Join {
+  // TODO: complete and document
+  type: RelationalJoinType = INNER;
+
+  left: TableExpr (required);
+  right: TableExpr (required);
+
+  // The expression to use for joining `left` and `right` tables
+  on_expr: ArrayExpr; // a missing on_expr indicates a cross join.
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left: TableExpr (required);
+  right: TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof: ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof: ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance: Literal;
+
+  /// If true, the
+  allow_equal: bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+// The order in which to sort rows.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+// The way in which NULL values should be ordered when sorting.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {
+  key: ArrayExpr (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering;
+}
+
+/// A table-generating function.
+///
+/// TODO
+table TableFunction {
+  descr: FunctionDescr (required);
+
+  /// An optional output schema for the table function. If not provided, must
+  /// be determined by the IR consumer.
+  out_schema: Schema;
+}
+
+union TableOperation {
+  ExternalTable,
+  Project,
+  Filter,
+  Aggregate,
+  Limit,
+  Join,
+  TableFunction
+}
+
+/// An expression
+table TableExpr {
+  /// The operation that yields data.
+  op: TableOperation (required);
+
+  /// An optional explicit name for this table expression, to enable
+  /// unambiguous column references. If not set, the name can be inherited from
+  /// an antecedent table in some cases.
+  name: string;
+
+  /// Optional output schema. A schema can be serialized here for informational
+  /// purposes, or to provide a checkpoint/assertion to the IR consumer about
+  /// what you expect the schema to be at this point. Always requiring it would
+  /// increase the on-wire size of a Table

Review comment:
       I'm skeptical that we should be concerned about the on-wire size of `TableExpr` here as opposed to getting the semantics we want.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#issuecomment-923095463


   Should this be closed in favor of #10934 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r685523322



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}
+
+/// A relational non-equijoin containing expressions which may include
+/// inequality or range conditions.
+table NonEqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+  left:TableExpr (required);
+  right:TableExpr (required);
+  left_exprs:[ArrayExpr];
+  right_exprs:[ArrayExpr];
+}
+
+/// A temporal join type
+///
+/// TODO: complete and document
+table AsOfJoin {
+  left:TableExpr (required);
+  right:TableExpr;
+
+  /// The column in the left expression to use for data ordering to determine
+  /// the "as of" time.
+  left_asof:ColumnReference (required);
+
+  /// The column in the right expression to use for data ordering to determine
+  /// the "as of" time. If the column name is the same as left_asof, may be
+  /// omitted.
+  right_asof:ColumnReference;
+
+  /// TODO: Define means of providing time deltas in this IR.
+  tolerance:Literal;
+
+  /// If true, the
+  allow_equal:bool = true;
+}
+
+/// An extension of as-of join which allows applying an aggregate function to
+/// the data falling within the indicated time interval.
+///
+/// TODO: Define semantics of "identity" window where all elements of window
+/// become a List<T> element in the result.
+table WindowJoin {
+  // TODO
+}
+
+/// An expression to use for sorting. The key expression determines the values
+/// to be used for ordering the table's rows.
+table SortKey {

Review comment:
       I've added a `NullOrdering` type to support this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682851923



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {
+  ref:TableReference;
+}
+
+/// A helper union to permit the "SELECT *, $expr0, ..." construct from SQL.
+union ProjectionExpr {
+  ArrayExpr,
+  StarSelection
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Projection {
+  input:TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs:[ProjectionExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input:TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition:ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input:TableExpr;
+  aggregate_exprs:[ArrayExpr] (required);
+
+  /// Expressions to use as group keys. If not provided, then the aggregate
+  /// operation yields a table with a single row.
+  group_exprs:[ArrayExpr];
+
+  /// A filter condition for the aggregation which may include aggregate
+  /// functions.
+  having:[ArrayExpr];
+}
+
+/// Select up to the indicated number of rows from the input expression based
+/// on the first-emitted rows when evaluating the input expression. Generally,
+/// no particular order is guaranteed unless combined with a sort expression.
+table Limit {
+  input:TableExpr;
+
+  /// Number of logical rows to select from input.
+  limit:long;
+}
+
+enum RelationalJoinType : int {
+  INNER = 0,
+  LEFT = 1,
+  RIGHT = 2,
+  OUTER = 3,
+  SEMI = 4,
+  ANTI = 5,
+}
+
+/// A standard relational / SQL-style equijoin.
+///
+/// Providing no left/right columns produces the cross product of the two
+/// tables.
+table EqualityJoin {
+  // TODO: complete and document
+  type:RelationalJoinType = INNER;
+
+  left:TableExpr (required);
+  right:TableExpr (required);
+
+  /// The columns from the left expression to use when joining
+  left_columns:[ColumnReference] (required);
+
+  /// The columns from the right expression to use when joining. Must have the
+  /// same length as left_columns.
+  ///
+  /// If omitted, the names provided in left_columns must match the names in
+  /// the right expression.
+  right_columns:[ColumnReference];
+}

Review comment:
       There is the `NonEqualityJoin` which allows for arbitrary expressions. This could be collapsed to be just a single expression-based join and leave it to the engine to decide how to execute the join




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681384803



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.

Review comment:
       nit: is it worth mentioning endianness?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jagill commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

jagill commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r702012964



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {

Review comment:
       How would we model the `VALUES` table-valued literal, such as in
   ```
   SELECT * FROM (
     VALUES
     (1, 'abc'),
     (2, 'def')
   ) AS t(x, y)
   ```
   ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692171382



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {

Review comment:
       An optimizer can operate directly on the IR, but there's requirement that any given IR be optimized at any point.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {

Review comment:
       An optimizer can operate directly on the IR, but there's no requirement that any given IR be optimized at any point.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] alamb commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r682831365



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {

Review comment:
       It is also possible to model things like `IS NOT NULL` as a unary function rather than a unique node in the IR if we want to reduce the number of types in this tree
   
   For example, you could model `IS NOT NULL <column>` like `ArrayFunction(descr="IsNull", inputs=[colum])` which perhaps is what @pitrou  is suggesting in  https://github.com/apache/arrow/pull/10856/files#r681583577
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] cpcloud commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

cpcloud commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r692175323



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,510 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data: [ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data: [LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code: int;  // required
+
+  value: LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type: Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data: LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type: Type (required);
+  data: [LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name: string;
+  value: Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name: string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name: string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table: TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input: ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD,
+  SUBTRACT,
+  MULTIPLY,
+  DIVIDE,
+  EQUAL,
+  NOT_EQUAL,
+  LESS,
+  LESS_EQUAL,
+  GREATER,
+  GREATER_EQUAL,
+  AND,
+  OR,
+  XOR
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type: BinaryOpType;
+  left: ArrayExpr (required);
+  right: ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR,
+  AGGREGATE,
+  WINDOW,
+  TABLE
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name: string (required);
+
+  type: FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data: [ubyte];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr: FunctionDescr (required);
+  inputs: [ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options: [NamedLiteral];
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition: ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then: ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else: ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input: ArrayExpr (required);
+  in_exprs: [ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated: bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input: ArrayExpr (required);
+  left_bound: ArrayExpr (required);
+  right_bound: ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op: ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name: string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type: Type;
+
+  window: Frame;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name: string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema: Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type: string;
+  serde_data: [ubyte];
+}
+
+enum FrameClause : uint8 {
+  ROWS,
+  RANGE
+}
+
+// Unbounded and CurrentRow are empty tables used as empty variants
+// in the Bound union.
+table Unbounded {}
+table CurrentRow {}
+
+// `Bound` represents the window bound computation in a window function like
+// `sum(x) over (unbounded preceding and current row)`.
+union Bound {
+  ArrayExpr,
+  Unbounded,
+  CurrentRow
+}
+
+// `Frame` models a window frame clause, capturing the kind of clause
+// (ROWS/RANGE), how to partition the window how to order within partitions and
+// the bounds of the window.
+table Frame {
+  clause: FrameClause;
+  partition_by: [ArrayExpr];
+  order_by: [SortKey];
+  preceding: Bound (required);
+  following: Bound (required);
+}
+
+/// Computes a new table given a set of column selections or array expressions.
+table Project {
+  input: TableExpr (required);
+
+  /// Each expression must reference fields foudn in the input table
+  /// expression.
+  exprs: [ArrayExpr] (required);
+}
+
+/// Select rows from table for given boolean condition.
+table Filter {
+  input: TableExpr (required);
+
+  /// Array expression using input table expression yielding boolean output
+  /// type.
+  condition: ArrayExpr (required);
+}
+
+/// A "group by" table aggregation: data is grouped using the group
+/// expressions, and the aggregate expressions are evaluated within each group.
+table Aggregate {
+  input: TableExpr;
+  aggregate_exprs: [ArrayExpr] (required);

Review comment:
       Ah, right. It looks like these are properties of the given invocation. Can they be combined (even if not always meaningfully)? E.g., `AGG(DISTINCT col1 ORDER BY col2) FILTER(col2 = 3)`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

emkornfield commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681387739



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];

Review comment:
       this might be undermodeled.  At least for comparison it might be worth how ZetaSql models some of these concepts:
   
   https://github.com/google/zetasql/blob/master/zetasql/public/builtin_function.proto
   and
   https://github.com/google/zetasql/blob/master/zetasql/public/function.proto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] julianhyde commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

Posted by GitBox <gi...@apache.org>.

julianhyde commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r683050000



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {
+  order_by:[SortKey];
+
+  partition_by:[ArrayExpr];
+}
+
+/// A general array function call, which may be built-in or user-defined.
+///
+/// It is recommended to put the function output type when using in an
+/// ArrayExpr. It is acceptable to omit the type if it is the same as all the
+/// inputs (for example, in the case of math functions when double input yields
+/// double output).
+table ArrayFunction {
+  descr:FunctionDescr (required);
+  inputs:[ArrayExpr] (required);
+
+  /// Optional non-data inputs for function invocation.
+  ///
+  /// It is recommended to limit use of options for functions that are expected
+  /// to be built-in in a generic IR consumer.
+  options:[NamedLiteral];
+
+  /// Optional window expression for window functions only.
+  ///
+  /// TODO: Decide if window functions should be specified in a different way.
+  window:WindowFrame;
+}
+
+/// Conditional if-then-else operation, selecting values from the then- or
+/// else-branch based on the provided boolean condition.
+///
+/// If the "then" and "else" expressions have different output types, it's
+/// recommended to indicate the promoted output type in an ArrayExpr when using
+/// this operator.
+table IfElse {
+  /// Boolean output type
+  condition:ArrayExpr (required);
+
+  /// Values to use when the condition is true
+  then:ArrayExpr (required);
+
+  /// Values to use when the condition is false
+  else:ArrayExpr (required);
+}
+
+/// Operation for expressing multiple equality checks with an expression.
+///
+/// IsIn(input, [value0, value1, ...])
+/// is the same as Or(Or(Eq(input, value0), Eq(input, value1)), ...)
+table IsIn {
+  input:ArrayExpr (required);
+  in_exprs:[ArrayExpr] (required);
+
+  /// If true, check whether values are not equal to any of the provided
+  /// expressions.
+  negated:bool = false;
+}
+
+/// Boolean operation checking whether input is bounded by the left and right
+/// expressions. Convenience for specifying the compound predicate manually.
+///
+/// input BETWEEN left_bound AND right_bound
+/// is the same as
+/// input >= left_bound AND input <= right_bound
+table Between {
+  input:ArrayExpr (required);
+  left_bound:ArrayExpr (required);
+  right_bound:ArrayExpr (required);
+}
+
+union ArrayOperation {
+  ColumnReference,
+  Literal,
+  BinaryOp,
+  ArrayFunction,
+  IfElse,
+  IsIn,
+  Between
+}
+
+///
+///
+/// An expression yielding a scalar value can be broadcasted to array shape as
+/// needed depending on use.
+table ArrayExpr {
+  op:ArrayOperation (required);
+
+  /// Optional name for array operation. If not provided, may be inferred from
+  /// antecedent inputs. IR producers are recommended to provide names to avoid
+  /// ambiguity.
+  name:string;
+
+  /// Expected output type of the array operation. While optional, IR producers
+  /// are encouraged to populate this field for the benefit of IR consumers.
+  out_type:Type;
+}
+
+/// ----------------------------------------------------------------------
+/// Table operations and TableExpr, which is a table operation plus an optional
+/// name and indicative output schema, and potentially other metadata.
+
+/// A named table which the IR producers expects the IR consumer to be able to
+/// access. A "table" in this context is anything that can produce
+/// Arrow-formatted data with the given schema. There is no notion of the
+/// physical layout of the data or its segmentation into multiple Arrow record
+/// batches.
+table ExternalTable {
+  /// The unique name to identify the data source.
+  name:string (required);
+
+  /// The schema of the data source. This may be a partial schema (ignoring
+  /// unused fields), but it at least asserts the fields and types that are
+  /// expected to exist in the data source.
+  schema:Schema (required);
+
+  /// Optional opaque table serialization data, for passing engine-specific
+  /// instructions to enable the data to be accessed.
+  serde_type:string;
+  serde_data:[ubyte];
+}
+
+/// An auxiliary helper instruction to "include all columns" in the projection,
+/// sparing the IR producer the need to enumerate all column references in a
+/// projection.
+table StarSelection {

Review comment:
       Let's not use the word 'selection'. It will only confuse. In academic relational algebra 'select' means 'Filter'.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org