You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by al...@apache.org on 2021/05/28 11:01:16 UTC
[arrow-datafusion] branch master updated: add output field name rfc
(#422)
This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/master by this push:
new c9ed34c add output field name rfc (#422)
c9ed34c is described below
commit c9ed34c16e47bd5da58b98f64a017e38ab8c946f
Author: QP Hou <qp...@scribd.com>
AuthorDate: Fri May 28 04:01:09 2021 -0700
add output field name rfc (#422)
* add output field name rfc
* move to spec model
* add link to developers docs & add ASF header
---
DEVELOPERS.md | 13 ++
docs/specification/output-field-name-semantic.md | 220 +++++++++++++++++++++++
2 files changed, 233 insertions(+)
diff --git a/DEVELOPERS.md b/DEVELOPERS.md
index c2daf3a..9223d99 100644
--- a/DEVELOPERS.md
+++ b/DEVELOPERS.md
@@ -94,6 +94,19 @@ can be displayed. For example, the following command creates a
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
```
+## Specification
+
+We formalize Datafusion semantics and behaviors through specification
+documents. These specifications are useful to be used as references to help
+resolve ambiguities during development or code reviews.
+
+You are also welcome to propose changes to existing specifications or create
+new specifications as you see fit.
+
+Here is the list current active specifications:
+
+* [Output field name semantic](docs/specification/output-field-name-semantic.md)
+
## How to format `.md` document
We are using `prettier` to format `.md` files.
diff --git a/docs/specification/output-field-name-semantic.md b/docs/specification/output-field-name-semantic.md
new file mode 100644
index 0000000..fd28d11
--- /dev/null
+++ b/docs/specification/output-field-name-semantic.md
@@ -0,0 +1,220 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# Datafusion output field name semantic
+
+This specification documents how field names in output record batches should be
+generated based on given user queries. The filed name rules apply to
+Datafusion queries planned from both SQL queries and Dataframe APIs.
+
+## Field name rules
+
+* All field names MUST not contain relation/table qualifier.
+ * Both `SELECT t1.id`, `SELECT id` and `df.select_columns(&["id"])` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+ * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+ * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+ * `SELECT -2` SHOULD result in field name: `(- 2)`
+* Operator and operand MUST be separated by spaces.
+ * `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
+* Function arguments MUST be separated by a comma `,` and a space.
+ * `SELECT f(c1,c2)` and `df.select(vec![f.udf("f")?.call(vec![col("c1"), col("c2")])])` SHOULD result in field name: `f(c1, c2)`
+
+## Appendices
+
+### Examples and comparison with other systems
+
+Data schema for test sample queries:
+
+```
+CREATE TABLE t1 (id INT, a VARCHAR(5));
+INSERT INTO t1 (id, a) VALUES (1, 'foo');
+INSERT INTO t1 (id, a) VALUES (2, 'bar');
+
+CREATE TABLE t2 (id INT, b VARCHAR(5));
+INSERT INTO t2 (id, b) VALUES (1, 'hello');
+INSERT INTO t2 (id, b) VALUES (2, 'world');
+```
+
+#### Projected columns
+
+Query:
+
+```
+SELECT t1.id, a, t2.id, b
+FROM t1
+JOIN t2 ON t1.id = t2.id
+```
+
+Datafusion Arrow record batches output:
+
+| id | a | id | b |
+|----|-----|----|-------|
+| 1 | foo | 1 | hello |
+| 2 | bar | 2 | world |
+
+
+Spark, MySQL 8 and PostgreSQL 13 output:
+
+| id | a | id | b |
+|----|-----|----|-------|
+| 1 | foo | 1 | hello |
+| 2 | bar | 2 | world |
+
+SQLite 3 output:
+
+| id | a | b |
+|----|-----|-------|
+| 1 | foo | hello |
+| 2 | bar | world |
+
+
+#### Function transformed columns
+
+Query:
+
+```
+SELECT ABS(t1.id), abs(-id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1 | 1 |
+| 2 | 2 |
+
+
+Spark output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1 | 1 |
+| 2 | 2 |
+
+
+MySQL 8 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1 | 1 |
+| 2 | 2 |
+
+PostgreSQL 13 output:
+
+| abs | abs |
+|-----|-----|
+| 1 | 1 |
+| 2 | 2 |
+
+SQlite 3 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1 | 1 |
+| 2 | 2 |
+
+
+#### Function with operators
+
+Query:
+
+```
+SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2 | 1 |
+| 4 | 4 |
+
+
+Spark output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2 | 1 |
+| 4 | 4 |
+
+MySQL 8 output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2 | 1 |
+| 4 | 4 |
+
+PostgreSQL output:
+
+| ?column? | abs |
+|----------|-----|
+| 2 | 1 |
+| 4 | 4 |
+
+SQLite output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2 | 1 |
+| 4 | 4 |
+
+
+#### Project literals
+
+Query:
+
+```
+SELECT 1, 2+5, 'foo_bar';
+```
+
+Datafusion Arrow record batches output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7 | foo_bar |
+
+
+Spark output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7 | foo_bar |
+
+MySQL output:
+
+| 1 | 2+5 | foo_bar |
+|---|-----|---------|
+| 1 | 7 | foo_bar |
+
+
+PostgreSQL output:
+
+| ?column? | ?column? | ?column? |
+|----------|----------|----------|
+| 1 | 7 | foo_bar |
+
+
+SQLite 3 output:
+
+| 1 | 2+5 | 'foo_bar' |
+|---|-----|-----------|
+| 1 | 7 | foo_bar |