You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 14:56:05 UTC
[GitHub] [arrow] wmalpica commented on a change in pull request #10856: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

wmalpica commented on a change in pull request #10856:
URL: https://github.com/apache/arrow/pull/10856#discussion_r681829802



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation checks if values are not null
+table IsNotNull {
+  input:ArrayExpr (required);
+}
+
+/// Operation flips true/false values in boolean expression
+table Not {}
+
+/// Operation flips sign of numeric expression
+table Negate {}
+
+/// Built-in binary operations. Other binary operations can be implemented
+/// using ArrayFunction/FunctionDescr
+enum BinaryOpType : int {
+  ADD = 0,
+  SUBTRACT = 1,
+  MULTIPLY = 2,
+  DIVIDE = 3,
+  EQUAL = 4,
+  NOT_EQUAL = 5,
+  LESS = 6,
+  LESS_EQUAL = 7,
+  GREATER = 8,
+  GREATER_EQUAL = 9,
+  AND = 10,
+  OR = 11,
+  XOR = 12
+}
+
+/// Built-in binary operation
+table BinaryOp {
+  type:BinaryOpType;
+  left:ArrayExpr (required);
+  right:ArrayExpr (required);
+}
+
+enum FunctionType : int {
+  SCALAR = 0,
+  AGGREGATE = 1,
+  WINDOW = 2,
+  TABLE = 3
+}
+
+/// A general-purpose descriptor for a built-in or user-defined
+/// function. Producers of the IR are encouraged to reuse FunctionDescr objects
+/// (by reusing the Flatbuffers offset) when a particular function appears
+/// multiple times in an expression. Arguments to a particular function call
+/// are supplied in ArrayFunction.
+table FunctionDescr {
+  /// Function name from list of available function names. Built-in functions
+  /// are expected to be chosen from a list of "canonical" or "unambiguous"
+  /// function names to provide a measure of normalization across backends that
+  /// implement this Compute IR.
+  ///
+  /// The name may refer to a user-defined function which has been registered
+  /// with the target engine. User-defined function data can also be passed
+  /// with the "data" member.
+  name:string (required);
+
+  type:FunctionType = SCALAR;
+
+  /// Optional arbitrary sidecar data (such as a serialized user-defined
+  /// function)..
+  data:[ubyte];
+}
+
+/// Auxiliary data structure providing parameters for a window function
+/// expression, as in the SQL OVER expression or in time series databases.
+///
+/// TODO: Finish this data type
+table WindowFrame {

Review comment:
       For the WindowFrame to fully support SQL window functions we would also need to be able to specify:
   If the window frame is ROWS based or RANGE based, and the start and end of the window frame, where the start or end can also be unbounded. https://www.sqltutorial.org/sql-window-functions/

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,521 @@
+/// Licensed to the Apache Software Foundation (ASF) under one
+/// or more contributor license agreements.  See the NOTICE file
+/// distributed with this work for additional information
+/// regarding copyright ownership.  The ASF licenses this file
+/// to you under the Apache License, Version 2.0 (the
+/// "License"); you may not use this file except in compliance
+/// with the License.  You may obtain a copy of the License at
+///
+///   http://www.apache.org/licenses/LICENSE-2.0
+///
+/// Unless required by applicable law or agreed to in writing,
+/// software distributed under the License is distributed on an
+/// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+/// KIND, either express or implied.  See the License for the
+/// specific language governing permissions and limitations
+/// under the License.
+
+/// Arrow Compute IR (Intermediate Representation)
+///
+/// The purpose of these data structures is to provide a language- and compute
+/// engine-agnostic representation of common analytical operations on Arrow
+/// data. This may include so-called "logical query plans" generated by SQL
+/// systems, but it can be used to serialize different types of expression or
+/// query fragments for various purposes. For example, a system could use this
+/// to serialize array expressions for transmitting filters/predicates.
+///
+/// The three main types of data objects dealt with in this IR are:
+///
+/// * Table: a data source having an Arrow schema, resolvable algebraically to
+///   a collection of Arrow record batches
+/// * Array: logically, a field in a Table
+/// * Scalar: a single value, which is broadcastable to Array as needed
+///
+/// This IR specifically does not provide for query planning or physical
+/// execution details. It also aims to be as comprehensive as possible in
+/// capturing compute operations expressible in different query engines or data
+/// frame libraries. Engines are not expected to implement everything here.
+///
+/// One of the most common areas of divergence in query engines are the names
+/// and semantics of functions that operation on scalar or array
+/// inputs. Efforts to standardize function names and their expected semantics
+/// will happen outside of the serialized IR format defined here.
+
+// We use the IPC Schema types to represent data typesa
+include "Schema.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// ----------------------------------------------------------------------
+/// Data serialization for literal (constant / scalar) values. This assumes
+/// that the consumer has basic knowledge of the Arrow format and data types
+/// such that the binary scalar data that is encoded here can be unpacked into
+/// an appropriate literal value object. For example, if the Type for a Literal
+/// is FloatingPoint with Precision::DOUBLE, then we would expect to have a
+/// PrimitiveLiteralData with an 8-byte value.
+
+/// Serialized data which, given a data type, can be unpacked into a scalar
+/// value data structure.
+///
+/// NB(wesm): This is simpler from a Flatbuffers perspective than having a
+/// separate data type for each Arrow type. Alternative proposals welcome.
+union LiteralData {
+  NullLiteralData,
+  PrimitiveLiteralData,
+  ListLiteralData,
+  StructLiteralData,
+  UnionLiteralData
+}
+
+/// Placeholder for any null value, whether with Null type or a different
+/// non-Null type.
+table NullLiteralData {}
+
+/// For all data types represented as fixed-size-binary value (numeric and
+/// binary/string types included). Boolean values are to be represented as a
+/// single byte with value 1 (true) or 0 (false).
+table PrimitiveLiteralData {
+  data:[ubyte] (required);
+}
+
+/// For List, LargeList, and FixedSizeList.
+table ListLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Struct
+table StructLiteralData {
+  data:[LiteralData] (required);
+}
+
+/// For Union
+table UnionLiteralData {
+  /// The type code (referencing the Union type) needed to reconstruct the
+  /// correct literal value.
+  type_code:int;  // required
+
+  value:LiteralData (required);
+}
+
+/// Literal serializes a scalar (constant) value in an array expression.
+table Literal {
+  type:Type (required);
+
+  /// The data needed to reconstruct the literal value.
+  data:LiteralData (required);
+}
+
+/// A sequence of literal values all having the same type.
+table LiteralVector {
+  type:Type (required);
+  data:[LiteralData] (required);
+}
+
+/// A name (key) and literal value, to use for map-like options fields.
+table NamedLiteral {
+  name:string;
+  value:Literal;
+}
+
+/// ----------------------------------------------------------------------
+/// One-dimensional operations (array/scalar input and output) and ArrayExpr,
+/// which is an operation plus a name and output type.
+
+/// A reference to an antecedent table schema in an expression tree
+table TableReference {
+  ///
+  name:string (required);
+}
+
+/// A reference to an antecedent column from a table schema in an expression
+/// tree.
+table ColumnReference {
+  name:string (required);
+
+  /// Optional reference to antecedent table in tree. Required when there is
+  /// referential ambiguity.
+  table:TableReference;
+}
+
+/// Operation checks if values are null
+table IsNull {

Review comment:
       We probably want to have a `UnaryOp` and `UnaryOpType` the way you have for `BinaryOp` below, and have `IsNull`, `IsNotNull`, `Not` and `Negate` be `UnaryOpType`s. We will need to have a lot more of these, for example string functions such as `UpperCase` and `LowerCase` just to mention a pair.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org