You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/02 04:50:02 UTC

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #443: add invariants spec

houqp commented on a change in pull request #443:
URL: https://github.com/apache/arrow-datafusion/pull/443#discussion_r643649837



##########
File path: docs/specification/invariants.md
##########
@@ -0,0 +1,327 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# DataFusion's Invariants
+
+This document enumerates invariants of DataFusion's logical and physical planes
+(functions, and nodes). Some of these invariants are currently not enforced.
+This document assumes that the reader is familiar with some of the codebase,
+including rust arrow's RecordBatch and Array.
+
+## Rational
+
+DataFusion's computational model is built on top of a dynamically typed arrow
+object, Array, that offers the interface `Array::as_any` to downcast itself to
+its statically typed versions (e.g. `Int32Array`). DataFusion uses
+`Array::data_type` to perform the respective downcasting on its physical
+operations. DataFusion uses a dynamic type system because the queries being
+executed are not always known at compile time: they are only known during the
+runtime (or query time) of programs built with DataFusion. This document is
+built on top of this principle.
+
+In dynamically typed interfaces, it is up to developers to enforce type
+invariances. This document declares some of these invariants, so that users
+know what they can expect from a query in DataFusion, and DataFusion developers
+know what they need to enforce at the coding level.
+
+## Notation
+
+* Field or physical field: the tuple name, `arrow::DataType` and nullability flag (a bool whether values can be null), represented in this document by `PF(name, type, nullable)`
+* Logical field: Field with a relation name. Represented in this document by `LF(relation, name, type, nullable)`
+* Projected plan: plan with projection as the root node.
+* Logical schema: a vector of logical fields, used by logical plan.
+* Physical schema: a vector of physical fields, used by both physical plan and Arrow record batch.
+
+### Logical
+
+#### Function
+
+An object that knows its valid incoming logical fields and how to derive its
+output logical field from its arguments' logical fields. A functions' output
+field is itself a function of its input fields:
+
+```
+logical_field(lf1: LF, lf2: LF, ...) -> LF
+```
+
+Examples:
+
+* `plus(a,b) -> LF(None, "{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is the function mapping input types to output type (`get_supertype` in our current implementation).
+* `length(a) -> LF(None, "length({a})", u32, a.nullable)`
+
+#### Plan
+
+A tree composed of other plans and functions (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its schema.
+
+Certain plans have a frozen schema (e.g. Scan), while others derive their
+schema from their child nodes.
+
+#### Column
+
+A type of logical node in a logical plan consists of field name and relation name.
+
+### Physical
+
+#### Function
+
+An object that knows how to derive its physical field from its arguments'
+physical fields, and also how to actually perform the computation on data. A
+functions' output physical field is a function of its input physical fields:
+
+```
+physical_field(PF1, PF2, ...) -> PF
+```
+
+Examples:
+
+* `plus(a,b) -> PF("{a} Plus {b}", d(a.type,b.type), a.nullable | b.nullable)` where d is a complex function (`get_supertype` in our current implementation) whose computation is for each element in the columns, sum the two entries together and return it in the same type as the smallest type of both columns.
+* `length(&str) -> PF("length({a})", u32, a.nullable)` whose computation is "count number of bytes in the string".
+
+#### Plan
+
+A tree (e.g. `Projection c1 + c2, c1 - c2 AS sum12; Scan c1 as u32, c2 as u64`)
+that knows how to derive its metadata and compute itself.
+
+Note how the physical plane does not know how to derive field names: field
+names are solely a property of the logical plane, as they are not needed in the

Review comment:
       @jorgecarleitao please correct me if I am wrong. I think the focus on this statement is on generation of field names. In the current code base, physical field names are generated using metadata from the logical plane, more specifically  from logical `Expr`s. The actual physical plan and physical expressions are not used to generated the field name.
   
   > names are used to line up inputs to PhysicalExpr
   
   This used to be true at the time when Jorge wrote it, but with https://github.com/apache/arrow-datafusion/pull/55, we will be changing it to line up inputs using index instead that. But regardless we make this change or not, it doesn't change the fact that physical filed names are only generated using metadata from the logical plane.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org