You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by al...@apache.org on 2021/05/28 11:01:16 UTC

[arrow-datafusion] branch master updated: add output field name rfc (#422)

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git


The following commit(s) were added to refs/heads/master by this push:
     new c9ed34c  add output field name rfc (#422)
c9ed34c is described below

commit c9ed34c16e47bd5da58b98f64a017e38ab8c946f
Author: QP Hou <qp...@scribd.com>
AuthorDate: Fri May 28 04:01:09 2021 -0700

    add output field name rfc (#422)
    
    * add output field name rfc
    
    * move to spec model
    
    * add link to developers docs & add ASF header
---
 DEVELOPERS.md                                    |  13 ++
 docs/specification/output-field-name-semantic.md | 220 +++++++++++++++++++++++
 2 files changed, 233 insertions(+)

diff --git a/DEVELOPERS.md b/DEVELOPERS.md
index c2daf3a..9223d99 100644
--- a/DEVELOPERS.md
+++ b/DEVELOPERS.md
@@ -94,6 +94,19 @@ can be displayed. For example, the following command creates a
 dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
 ```
 
+## Specification
+
+We formalize Datafusion semantics and behaviors through specification
+documents. These specifications are useful to be used as references to help
+resolve ambiguities during development or code reviews.
+
+You are also welcome to propose changes to existing specifications or create
+new specifications as you see fit.
+
+Here is the list current active specifications:
+
+* [Output field name semantic](docs/specification/output-field-name-semantic.md)
+
 ## How to format `.md` document
 
 We are using `prettier` to format `.md` files.
diff --git a/docs/specification/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md
new file mode 100644
index 0000000..fd28d11
--- /dev/null
+++ b/docs/specification/output-field-name-semantic.md
@@ -0,0 +1,220 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# Datafusion output field name semantic
+
+This specification documents how field names in output record batches should be
+generated based on given user queries. The filed name rules apply to
+Datafusion queries planned from both SQL queries and Dataframe APIs.
+
+##  Field name rules
+
+* All field names MUST not contain relation/table qualifier.
+  * Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`
+* Operator and operand MUST be separated by spaces.
+  * `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
+* Function arguments MUST be separated by a comma `,` and a space.
+  * `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"), col("c2")])])`  SHOULD result in field name: `f(c1, c2)`
+
+## Appendices
+
+### Examples and comparison with other systems
+
+Data schema for test sample queries:
+
+```
+CREATE TABLE t1 (id INT, a VARCHAR(5));
+INSERT INTO t1 (id, a) VALUES (1, 'foo');
+INSERT INTO t1 (id, a) VALUES (2, 'bar');
+
+CREATE TABLE t2 (id INT, b VARCHAR(5));
+INSERT INTO t2 (id, b) VALUES (1, 'hello');
+INSERT INTO t2 (id, b) VALUES (2, 'world');
+```
+
+#### Projected columns
+
+Query:
+
+```
+SELECT t1.id, a, t2.id, b
+FROM t1
+JOIN t2 ON t1.id = t2.id
+```
+
+Datafusion Arrow record batches output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+
+Spark, MySQL 8 and PostgreSQL 13 output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+SQLite 3 output:
+
+| id | a   | b     |
+|----|-----|-------|
+| 1  | foo | hello |
+| 2  | bar | world |
+
+
+#### Function transformed columns
+
+Query:
+
+```
+SELECT ABS(t1.id), abs(-id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+Spark output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+MySQL 8 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+PostgreSQL 13 output:
+
+| abs | abs |
+|-----|-----|
+| 1   | 1   |
+| 2   | 2   |
+
+SQlite 3 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+
+#### Function with operators
+
+Query:
+
+```
+SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+
+Spark output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+MySQL 8 output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+PostgreSQL output:
+
+| ?column? | abs |
+|----------|-----|
+| 2        | 1   |
+| 4        | 4   |
+
+SQLite output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+
+#### Project literals
+
+Query:
+
+```
+SELECT 1, 2+5, 'foo_bar';
+```
+
+Datafusion Arrow record batches output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+
+Spark output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+MySQL output:
+
+| 1 | 2+5 | foo_bar |
+|---|-----|---------|
+| 1 | 7   | foo_bar |
+
+
+PostgreSQL output:
+
+| ?column? | ?column? | ?column? |
+|----------|----------|----------|
+| 1        | 7        | foo_bar  |
+
+
+SQLite 3 output:
+
+| 1 | 2+5 | 'foo_bar' |
+|---|-----|-----------|
+| 1 | 7   | foo_bar   |