You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/13 22:15:18 UTC

[GitHub] [arrow] kkraus14 commented on a change in pull request #10934: [RFC] Arrow Compute Serialized Intermediate Representation draft for discussion

kkraus14 commented on a change in pull request #10934:
URL: https://github.com/apache/arrow/pull/10934#discussion_r688802634



##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+include "Schema.fbs";
+include "Message.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// Wrapper for blobs of arbitrary bytes
+table Blob {
+  bytes: [ubyte];
+}
+
+/// An expression is one of
+/// - a Literal datum
+/// - a reference to a Field from a Relation
+/// - a call to a named function
+/// On evaluation, an Expression will have either array or scalar shape.
+union Expression {
+  Literal, FieldRef, Call
+}
+
+union Shape {
+  ArrayShape, ScalarShape
+}
+
+table ScalarShape {}
+
+table ArrayShape {
+  /// Number of slots.
+  length: long;
+}
+
+table Literal {
+  /// Shape of this literal.
+  shape: Shape (required);
+
+  /// The type of this literal.
+  type: Type (required);
+
+  /// Buffers containing `length` elements of arrow-formatted data.
+  /// If `length` is absent (this Literal is scalar), these buffers
+  /// are sized to accommodate a single element of arrow-formatted data.
+  /// XXX this can be optimized for trivial scalars later
+  buffers: [Buffer];
+}
+
+table FieldRef {
+  /// A sequence of field names to allow referencing potentially nested fields
+  path: [string];
+
+  /// For Expressions which might reference fields in multiple Relations,
+  /// this index may be provided to indicate which Relation's fields
+  /// `path` points into. For example in the case of a join,
+  /// 0 refers to the left relation and 1 to the right relation.
+  relation_index: int;
+
+  /// The type of data in the referenced Field.
+  type: Type;
+}
+
+table Call {
+  /// The name of the function whose invocation this Call represents.
+  function_name: string (required);
+
+  /// Parameters for `function_name`; content/format may be unique to each
+  /// value of `function_name`.
+  options: Blob;
+
+  /// The arguments passed to `function_name`.
+  arguments: [Expression] (required);
+
+  /// The type of data which invoking `function_name` will return.
+  type: Type;
+}
+
+/// A relation is a set of rows with consitent schema.
+table Relation {
+  /// The namespaced name of this Relation.
+  ///
+  /// Names with no namespace are reserved for pure relational
+  /// algebraic operations, which currently include:
+  ///   "filter"
+  ///   "project"
+  ///   "aggregate"
+  ///   "join"
+  ///   "order_by"
+  ///   "limit"
+  ///   "literal"
+  ///   "interactive_output"
+  relation_name: string (required);
+
+  /// Parameters for `relation_name`; content/format may be unique to each
+  /// value of `relation_name`.
+  options: Blob;
+
+  /// The arguments passed to `relation_name`.
+  arguments: [Relation] (required);
+
+  /// The schema of rows in this Relation
+  schema: Schema;
+}
+
+/// The contents of Relation.options will be FilterOptions
+/// if Relation.name = "filter"
+table FilterOptions {
+  /// The expression which will be evaluated against input rows
+  /// to determine whether they should be excluded from the
+  /// "filter" relation's output.
+  filter_expression: Expression (required);
+}
+
+/// The contents of Relation.options will be ProjectOptions
+/// if Relation.name = "project"
+table ProjectOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "project" relation's output.
+  expressions: [Expression] (required);
+}
+
+/// The contents of Relation.options will be AggregateOptions
+/// if Relation.name = "aggregate"
+table AggregateOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "aggregate" relation's output.
+  aggregations: [Expression] (required);
+  /// Keys by which `aggregations` will be grouped.
+  keys: [Expression];
+}
+
+/// The contents of Relation.options will be JoinOptions
+/// if Relation.name = "join"
+table JoinOptions {
+  /// The expression which will be evaluated against rows from each
+  /// input to determine whether they should be included in the
+  /// "join" relation's output.
+  on_expression: Expression (required);
+  join_kind: string;
+}
+
+/// Whether lesser values should precede greater or vice versa.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+/// Whether nulls should precede or follow other values.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}

Review comment:
       Maybe we should clarify whether this is irrespective of `Ordering`?
   
   If it's not, then would maybe suggest changing to `GREATEST` and `LEAST`.

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+include "Schema.fbs";
+include "Message.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// Wrapper for blobs of arbitrary bytes
+table Blob {
+  bytes: [ubyte];
+}
+
+/// An expression is one of
+/// - a Literal datum
+/// - a reference to a Field from a Relation
+/// - a call to a named function
+/// On evaluation, an Expression will have either array or scalar shape.
+union Expression {
+  Literal, FieldRef, Call
+}
+
+union Shape {
+  ArrayShape, ScalarShape
+}
+
+table ScalarShape {}
+
+table ArrayShape {
+  /// Number of slots.
+  length: long;
+}
+
+table Literal {
+  /// Shape of this literal.
+  shape: Shape (required);
+
+  /// The type of this literal.
+  type: Type (required);
+
+  /// Buffers containing `length` elements of arrow-formatted data.
+  /// If `length` is absent (this Literal is scalar), these buffers
+  /// are sized to accommodate a single element of arrow-formatted data.
+  /// XXX this can be optimized for trivial scalars later
+  buffers: [Buffer];
+}
+
+table FieldRef {
+  /// A sequence of field names to allow referencing potentially nested fields
+  path: [string];
+
+  /// For Expressions which might reference fields in multiple Relations,
+  /// this index may be provided to indicate which Relation's fields
+  /// `path` points into. For example in the case of a join,
+  /// 0 refers to the left relation and 1 to the right relation.
+  relation_index: int;
+
+  /// The type of data in the referenced Field.
+  type: Type;
+}
+
+table Call {
+  /// The name of the function whose invocation this Call represents.
+  function_name: string (required);
+
+  /// Parameters for `function_name`; content/format may be unique to each
+  /// value of `function_name`.
+  options: Blob;
+
+  /// The arguments passed to `function_name`.
+  arguments: [Expression] (required);
+
+  /// The type of data which invoking `function_name` will return.
+  type: Type;
+}
+
+/// A relation is a set of rows with consitent schema.

Review comment:
       ```suggestion
   /// A relation is a set of rows with consistent schema.
   ```

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+include "Schema.fbs";
+include "Message.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// Wrapper for blobs of arbitrary bytes
+table Blob {
+  bytes: [ubyte];
+}
+
+/// An expression is one of
+/// - a Literal datum
+/// - a reference to a Field from a Relation
+/// - a call to a named function
+/// On evaluation, an Expression will have either array or scalar shape.
+union Expression {
+  Literal, FieldRef, Call
+}
+
+union Shape {
+  ArrayShape, ScalarShape
+}
+
+table ScalarShape {}
+
+table ArrayShape {
+  /// Number of slots.
+  length: long;
+}
+
+table Literal {
+  /// Shape of this literal.
+  shape: Shape (required);
+
+  /// The type of this literal.
+  type: Type (required);
+
+  /// Buffers containing `length` elements of arrow-formatted data.
+  /// If `length` is absent (this Literal is scalar), these buffers
+  /// are sized to accommodate a single element of arrow-formatted data.
+  /// XXX this can be optimized for trivial scalars later
+  buffers: [Buffer];
+}
+
+table FieldRef {
+  /// A sequence of field names to allow referencing potentially nested fields
+  path: [string];
+
+  /// For Expressions which might reference fields in multiple Relations,
+  /// this index may be provided to indicate which Relation's fields
+  /// `path` points into. For example in the case of a join,
+  /// 0 refers to the left relation and 1 to the right relation.
+  relation_index: int;
+
+  /// The type of data in the referenced Field.
+  type: Type;
+}
+
+table Call {
+  /// The name of the function whose invocation this Call represents.
+  function_name: string (required);
+
+  /// Parameters for `function_name`; content/format may be unique to each
+  /// value of `function_name`.
+  options: Blob;
+
+  /// The arguments passed to `function_name`.
+  arguments: [Expression] (required);
+
+  /// The type of data which invoking `function_name` will return.
+  type: Type;
+}
+
+/// A relation is a set of rows with consitent schema.
+table Relation {
+  /// The namespaced name of this Relation.
+  ///
+  /// Names with no namespace are reserved for pure relational
+  /// algebraic operations, which currently include:
+  ///   "filter"
+  ///   "project"
+  ///   "aggregate"
+  ///   "join"
+  ///   "order_by"
+  ///   "limit"
+  ///   "literal"
+  ///   "interactive_output"
+  relation_name: string (required);
+
+  /// Parameters for `relation_name`; content/format may be unique to each
+  /// value of `relation_name`.
+  options: Blob;
+
+  /// The arguments passed to `relation_name`.
+  arguments: [Relation] (required);
+
+  /// The schema of rows in this Relation
+  schema: Schema;
+}
+
+/// The contents of Relation.options will be FilterOptions
+/// if Relation.name = "filter"
+table FilterOptions {
+  /// The expression which will be evaluated against input rows
+  /// to determine whether they should be excluded from the
+  /// "filter" relation's output.
+  filter_expression: Expression (required);
+}
+
+/// The contents of Relation.options will be ProjectOptions
+/// if Relation.name = "project"
+table ProjectOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "project" relation's output.
+  expressions: [Expression] (required);
+}
+
+/// The contents of Relation.options will be AggregateOptions
+/// if Relation.name = "aggregate"
+table AggregateOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "aggregate" relation's output.
+  aggregations: [Expression] (required);
+  /// Keys by which `aggregations` will be grouped.
+  keys: [Expression];
+}

Review comment:
       Do we need the ability to control `null` and/or `NaN` behavior here? Are they always grouped together?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+include "Schema.fbs";
+include "Message.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// Wrapper for blobs of arbitrary bytes
+table Blob {
+  bytes: [ubyte];
+}
+
+/// An expression is one of
+/// - a Literal datum
+/// - a reference to a Field from a Relation
+/// - a call to a named function
+/// On evaluation, an Expression will have either array or scalar shape.
+union Expression {
+  Literal, FieldRef, Call
+}
+
+union Shape {
+  ArrayShape, ScalarShape
+}
+
+table ScalarShape {}
+
+table ArrayShape {
+  /// Number of slots.
+  length: long;
+}
+
+table Literal {
+  /// Shape of this literal.
+  shape: Shape (required);
+
+  /// The type of this literal.
+  type: Type (required);
+
+  /// Buffers containing `length` elements of arrow-formatted data.
+  /// If `length` is absent (this Literal is scalar), these buffers
+  /// are sized to accommodate a single element of arrow-formatted data.
+  /// XXX this can be optimized for trivial scalars later
+  buffers: [Buffer];
+}
+
+table FieldRef {
+  /// A sequence of field names to allow referencing potentially nested fields
+  path: [string];
+
+  /// For Expressions which might reference fields in multiple Relations,
+  /// this index may be provided to indicate which Relation's fields
+  /// `path` points into. For example in the case of a join,
+  /// 0 refers to the left relation and 1 to the right relation.
+  relation_index: int;
+
+  /// The type of data in the referenced Field.
+  type: Type;
+}
+
+table Call {
+  /// The name of the function whose invocation this Call represents.
+  function_name: string (required);
+
+  /// Parameters for `function_name`; content/format may be unique to each
+  /// value of `function_name`.
+  options: Blob;
+
+  /// The arguments passed to `function_name`.
+  arguments: [Expression] (required);
+
+  /// The type of data which invoking `function_name` will return.
+  type: Type;
+}
+
+/// A relation is a set of rows with consitent schema.
+table Relation {
+  /// The namespaced name of this Relation.
+  ///
+  /// Names with no namespace are reserved for pure relational
+  /// algebraic operations, which currently include:
+  ///   "filter"
+  ///   "project"
+  ///   "aggregate"
+  ///   "join"
+  ///   "order_by"
+  ///   "limit"
+  ///   "literal"
+  ///   "interactive_output"
+  relation_name: string (required);
+
+  /// Parameters for `relation_name`; content/format may be unique to each
+  /// value of `relation_name`.
+  options: Blob;
+
+  /// The arguments passed to `relation_name`.
+  arguments: [Relation] (required);
+
+  /// The schema of rows in this Relation
+  schema: Schema;
+}
+
+/// The contents of Relation.options will be FilterOptions
+/// if Relation.name = "filter"
+table FilterOptions {
+  /// The expression which will be evaluated against input rows
+  /// to determine whether they should be excluded from the
+  /// "filter" relation's output.
+  filter_expression: Expression (required);
+}
+
+/// The contents of Relation.options will be ProjectOptions
+/// if Relation.name = "project"
+table ProjectOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "project" relation's output.
+  expressions: [Expression] (required);
+}
+
+/// The contents of Relation.options will be AggregateOptions
+/// if Relation.name = "aggregate"
+table AggregateOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "aggregate" relation's output.
+  aggregations: [Expression] (required);
+  /// Keys by which `aggregations` will be grouped.
+  keys: [Expression];
+}
+
+/// The contents of Relation.options will be JoinOptions
+/// if Relation.name = "join"
+table JoinOptions {
+  /// The expression which will be evaluated against rows from each
+  /// input to determine whether they should be included in the
+  /// "join" relation's output.
+  on_expression: Expression (required);
+  join_kind: string;
+}

Review comment:
       Do we need the ability to control whether `null` and/or `NaN` values are considered equal here?

##########
File path: format/ComputeIR.fbs
##########
@@ -0,0 +1,210 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+include "Schema.fbs";
+include "Message.fbs";
+
+namespace org.apache.arrow.flatbuf.computeir;
+
+/// Wrapper for blobs of arbitrary bytes
+table Blob {
+  bytes: [ubyte];
+}
+
+/// An expression is one of
+/// - a Literal datum
+/// - a reference to a Field from a Relation
+/// - a call to a named function
+/// On evaluation, an Expression will have either array or scalar shape.
+union Expression {
+  Literal, FieldRef, Call
+}
+
+union Shape {
+  ArrayShape, ScalarShape
+}
+
+table ScalarShape {}
+
+table ArrayShape {
+  /// Number of slots.
+  length: long;
+}
+
+table Literal {
+  /// Shape of this literal.
+  shape: Shape (required);
+
+  /// The type of this literal.
+  type: Type (required);
+
+  /// Buffers containing `length` elements of arrow-formatted data.
+  /// If `length` is absent (this Literal is scalar), these buffers
+  /// are sized to accommodate a single element of arrow-formatted data.
+  /// XXX this can be optimized for trivial scalars later
+  buffers: [Buffer];
+}
+
+table FieldRef {
+  /// A sequence of field names to allow referencing potentially nested fields
+  path: [string];
+
+  /// For Expressions which might reference fields in multiple Relations,
+  /// this index may be provided to indicate which Relation's fields
+  /// `path` points into. For example in the case of a join,
+  /// 0 refers to the left relation and 1 to the right relation.
+  relation_index: int;
+
+  /// The type of data in the referenced Field.
+  type: Type;
+}
+
+table Call {
+  /// The name of the function whose invocation this Call represents.
+  function_name: string (required);
+
+  /// Parameters for `function_name`; content/format may be unique to each
+  /// value of `function_name`.
+  options: Blob;
+
+  /// The arguments passed to `function_name`.
+  arguments: [Expression] (required);
+
+  /// The type of data which invoking `function_name` will return.
+  type: Type;
+}
+
+/// A relation is a set of rows with consitent schema.
+table Relation {
+  /// The namespaced name of this Relation.
+  ///
+  /// Names with no namespace are reserved for pure relational
+  /// algebraic operations, which currently include:
+  ///   "filter"
+  ///   "project"
+  ///   "aggregate"
+  ///   "join"
+  ///   "order_by"
+  ///   "limit"
+  ///   "literal"
+  ///   "interactive_output"
+  relation_name: string (required);
+
+  /// Parameters for `relation_name`; content/format may be unique to each
+  /// value of `relation_name`.
+  options: Blob;
+
+  /// The arguments passed to `relation_name`.
+  arguments: [Relation] (required);
+
+  /// The schema of rows in this Relation
+  schema: Schema;
+}
+
+/// The contents of Relation.options will be FilterOptions
+/// if Relation.name = "filter"
+table FilterOptions {
+  /// The expression which will be evaluated against input rows
+  /// to determine whether they should be excluded from the
+  /// "filter" relation's output.
+  filter_expression: Expression (required);
+}
+
+/// The contents of Relation.options will be ProjectOptions
+/// if Relation.name = "project"
+table ProjectOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "project" relation's output.
+  expressions: [Expression] (required);
+}
+
+/// The contents of Relation.options will be AggregateOptions
+/// if Relation.name = "aggregate"
+table AggregateOptions {
+  /// Expressions which will be evaluated to produce to
+  /// the rows of the "aggregate" relation's output.
+  aggregations: [Expression] (required);
+  /// Keys by which `aggregations` will be grouped.
+  keys: [Expression];
+}
+
+/// The contents of Relation.options will be JoinOptions
+/// if Relation.name = "join"
+table JoinOptions {
+  /// The expression which will be evaluated against rows from each
+  /// input to determine whether they should be included in the
+  /// "join" relation's output.
+  on_expression: Expression (required);
+  join_kind: string;
+}
+
+/// Whether lesser values should precede greater or vice versa.
+enum Ordering : uint8 {
+  ASCENDING,
+  DESCENDING,
+}
+
+/// Whether nulls should precede or follow other values.
+enum NullOrdering : uint8 {
+  FIRST,
+  LAST
+}
+
+table SortKey {
+  value: Expression (required);
+  ordering: Ordering = ASCENDING;
+  null_ordering: NullOrdering = LAST;

Review comment:
       Do we need to be able to control `NaN` ordering as well for floating point columns? And `NaN` versus `null` ordering?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org