You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/25 06:31:46 UTC

[GitHub] [arrow-datafusion] houqp opened a new pull request #422: add output field name rfc

houqp opened a new pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #302.
   Relates to https://github.com/apache/arrow-datafusion/pull/55.
   
   Once this is reviewed and accepted, I will send the update invariant doc in a follow up PR.
   
    # Rationale for this change
   
   Unblock PR #55.
   
   # What changes are included in this PR?
   
   First RFC for field name semantic and RFC process.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-849764351


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/422?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#422](https://codecov.io/gh/apache/arrow-datafusion/pull/422?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (6fcf0b5) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/9c0ad7b68387181e9d35d34c5dd55b6fe43b94d3?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9c0ad7b) will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/422/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/422?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #422   +/-   ##
   =======================================
     Coverage   75.27%   75.27%           
   =======================================
     Files         147      147           
     Lines       24834    24834           
   =======================================
     Hits        18694    18694           
     Misses       6140     6140           
   ```
   
   
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/422?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/422?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [9c0ad7b...6fcf0b5](https://codecov.io/gh/apache/arrow-datafusion/pull/422?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-849084973


   Looks like we just need a RAT (apache copyright statement) to get a clean CI run and merge it. Perhaps we can add some section to the developer's guide once implemented with the "here is how output names are created" with a link to this RFC for the rationale


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638979116



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       One concern I had when implementing something very similar to this RFC here `https://github.com/apache/arrow-datafusion/pull/280` is that introducing an alias makes the column available under that name in the whole query. IMO that can be confusing if you do use this implicit convention. Any idea what other engines / databases are doing here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638979116



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       One concern I had when implementing something very similar to this RFC here `https://github.com/apache/arrow-datafusion/pull/280` is that renaming the expressions / introducing an alias makes the column available under that name in the whole query. IMO that can be confusing if you do use this implicit convention. Any idea what other engines / databases are doing here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638975536



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic

Review comment:
       Might be helpful to number/add a date to the RFCs? E.g.
   
   `2021-05-25-output-field-name-semantic.md` 
   or
   `00001-output-field-name-semantic.md `




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639460465



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       @Dandandan I documented a survey of behavior from mysql/sqlite/postgres/spark in this doc as well, for example: https://github.com/houqp/arrow-datafusion/blob/qp_rfc/docs/rfcs/output-field-name-semantic.md#function-with-operators. Basically mysql and sqlite use the raw user query as the column name, postgres throws in the towel and just use `?column?` for everything while spark SQL constructs the column name based on the expression.
   
   I picked Spark's behavior because it's the one that is the closest to what we had at the time. But since you already implemented mysql and sqlite's behavior since then, i am happy to update the doc to account for that. In this case, we need two sets of rules, one for SQL queries, which is to just reuse what's provided in the query. The other one for dataframe queries, which is what I outlined in this doc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r640491326



##########
File path: DEVELOPERS.md
##########
@@ -94,6 +94,19 @@ can be displayed. For example, the following command creates a
 dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
 ```
 
+## Specification
+
+We formalize Datafusion semantics and behaviors through specification
+documents. These specifications are useful to be used as references to help
+resolve ambiguities during development or code reviews.
+
+You are also welcome to propose changes to existing specifications or create
+new specifications as you see fit.
+
+Here is the list current active specifications:
+
+* [Output field name semantic](docs/specification/output-field-name-semantic.md)

Review comment:
       👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

alamb commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638965856



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`

Review comment:
       Should there really be a space between `-` and `2`?
   
   ```suggestion
     * `SELECT -2` SHOULD result in field name: `(-2)`
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639461793



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`

Review comment:
       This came from the rule below: `Operator and operand MUST be separated by spaces.`. I don't have a strong opinion on this. Picked this because this is what spark does today. We can also change the rule below to require no spaces between operator and operands.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639474603



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`
+* Operator and operand MUST be separated by spaces.
+  * `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
+* Function arguments MUST be separated by a comma `,` and a space.
+  * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)`
+
+
+### Examples and comparison with other systems
+
+Data schema for test sample queries:
+
+```
+CREATE TABLE t1 (id INT, a VARCHAR(5));
+INSERT INTO t1 (id, a) VALUES (1, 'foo');
+INSERT INTO t1 (id, a) VALUES (2, 'bar');
+
+CREATE TABLE t2 (id INT, b VARCHAR(5));
+INSERT INTO t2 (id, b) VALUES (1, 'hello');
+INSERT INTO t2 (id, b) VALUES (2, 'world');
+```
+
+#### Projected columns
+
+Query:
+
+```
+SELECT t1.id, a, t2.id, b
+FROM t1
+JOIN t2 ON t1.id = t2.id
+```
+
+Datafusion Arrow record batches output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+
+Spark, MySQL 8 and PostgreSQL 13 output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+SQLite 3 output:
+
+| id | a   | b     |
+|----|-----|-------|
+| 1  | foo | hello |
+| 2  | bar | world |
+
+
+#### Function transformed columns
+
+Query:
+
+```
+SELECT ABS(t1.id), abs(-id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+Spark output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+MySQL 8 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+PostgreSQL 13 output:
+
+| abs | abs |
+|-----|-----|
+| 1   | 1   |
+| 2   | 2   |
+
+SQlite 3 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+
+#### Function with operators
+
+Query:
+
+```
+SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+
+Spark output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+MySQL 8 output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+PostgreSQL output:
+
+| ?column? | abs |
+|----------|-----|
+| 2        | 1   |
+| 4        | 4   |
+
+SQLite output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+
+#### Project literals
+
+Query:
+
+```
+SELECT 1, 2+5, 'foo_bar';
+```
+
+Datafusion Arrow record batches output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+
+Spark output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+MySQL output:
+
+| 1 | 2+5 | foo_bar |
+|---|-----|---------|
+| 1 | 7   | foo_bar |
+
+
+PostgreSQL output:
+
+| ?column? | ?column? | ?column? |
+|----------|----------|----------|
+| 1        | 7        | foo_bar  |
+
+
+SQLite 3 output:
+
+| 1 | 2+5 | 'foo_bar' |
+|---|-----|-----------|
+| 1 | 7   | foo_bar   |
+
+
+## Alternatives
+
+Postgres's behavior is too simple. It defaults to `?column?` as the column name
+in many cases, which makes output less readable than other designs.
+
+MySQL and SQLite preserve user query input as the field name. This adds extra
+implementation and runtime overhead with little gain for end uers.

Review comment:
       When I wrote this, I was mostly thinking about attaching the extra alias expression node and having to traverse this node multiple times during plan optimization. But honestly, I don't think this overhead matters in practice. It's just one more thing we need to do compared to not preserving the raw query input.
   
   Getting rid of the qualifier during physical planning is just one line change in the function that returns physical field name for a logical column.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638989781



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       (So it looks like this is a common practice). Confusing - yes - but maybe not a hige concern, as users might not often rely on this as it is quite unconventional.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-850337390


   windows test failure is unrelated


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638987177



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`
+* Operator and operand MUST be separated by spaces.
+  * `SELECT 1+2` SHOULD result in field name: `(1 + 2)`
+* Function arguments MUST be separated by a comma `,` and a space.
+  * `SELECT f(c1,c2)` SHOULD result in field name: `f(c1, c2)`
+
+
+### Examples and comparison with other systems
+
+Data schema for test sample queries:
+
+```
+CREATE TABLE t1 (id INT, a VARCHAR(5));
+INSERT INTO t1 (id, a) VALUES (1, 'foo');
+INSERT INTO t1 (id, a) VALUES (2, 'bar');
+
+CREATE TABLE t2 (id INT, b VARCHAR(5));
+INSERT INTO t2 (id, b) VALUES (1, 'hello');
+INSERT INTO t2 (id, b) VALUES (2, 'world');
+```
+
+#### Projected columns
+
+Query:
+
+```
+SELECT t1.id, a, t2.id, b
+FROM t1
+JOIN t2 ON t1.id = t2.id
+```
+
+Datafusion Arrow record batches output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+
+Spark, MySQL 8 and PostgreSQL 13 output:
+
+| id | a   | id | b     |
+|----|-----|----|-------|
+| 1  | foo | 1  | hello |
+| 2  | bar | 2  | world |
+
+SQLite 3 output:
+
+| id | a   | b     |
+|----|-----|-------|
+| 1  | foo | hello |
+| 2  | bar | world |
+
+
+#### Function transformed columns
+
+Query:
+
+```
+SELECT ABS(t1.id), abs(-id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+Spark output:
+
+| abs(id) | abs((- id)) |
+|---------|-------------|
+| 1       | 1           |
+| 2       | 2           |
+
+
+MySQL 8 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+PostgreSQL 13 output:
+
+| abs | abs |
+|-----|-----|
+| 1   | 1   |
+| 2   | 2   |
+
+SQlite 3 output:
+
+| ABS(t1.id) | abs(-id) |
+|------------|----------|
+| 1          | 1        |
+| 2          | 2        |
+
+
+#### Function with operators
+
+Query:
+
+```
+SELECT t1.id + ABS(id), ABS(id * t1.id) FROM t1;
+```
+
+Datafusion Arrow record batches output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+
+Spark output:
+
+| id + abs(id) | abs(id * id) |
+|--------------|--------------|
+| 2            | 1            |
+| 4            | 4            |
+
+MySQL 8 output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+PostgreSQL output:
+
+| ?column? | abs |
+|----------|-----|
+| 2        | 1   |
+| 4        | 4   |
+
+SQLite output:
+
+| t1.id + ABS(id) | ABS(id * t1.id) |
+|-----------------|-----------------|
+| 2               | 1               |
+| 4               | 4               |
+
+
+#### Project literals
+
+Query:
+
+```
+SELECT 1, 2+5, 'foo_bar';
+```
+
+Datafusion Arrow record batches output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+
+Spark output:
+
+| 1 | (2 + 5) | foo_bar |
+|---|---------|---------|
+| 1 | 7       | foo_bar |
+
+MySQL output:
+
+| 1 | 2+5 | foo_bar |
+|---|-----|---------|
+| 1 | 7   | foo_bar |
+
+
+PostgreSQL output:
+
+| ?column? | ?column? | ?column? |
+|----------|----------|----------|
+| 1        | 7        | foo_bar  |
+
+
+SQLite 3 output:
+
+| 1 | 2+5 | 'foo_bar' |
+|---|-----|-----------|
+| 1 | 7   | foo_bar   |
+
+
+## Alternatives
+
+Postgres's behavior is too simple. It defaults to `?column?` as the column name
+in many cases, which makes output less readable than other designs.
+
+MySQL and SQLite preserve user query input as the field name. This adds extra
+implementation and runtime overhead with little gain for end uers.

Review comment:
       What is the extra implementation / runtime overhead? I would say just printing the original (representation)  of expressions should be less work than getting rid of qualified names? Preserving casing is currently not something done by the parser - but qualified names are maintained.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-849341833


   @alamb @jorgecarleitao @Dandandan I reorganized everything to better align with the specification model. Could you take another look to see if there is anything you would like me to change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-848039362


   So, a question that I would like to bring before committing to RFCs. I see two ways of approaching this: 
   
   1. the delta approach: each RFC may change previous RFCs, becoming the "latest" or a mixture of
   2. the specification approach: each change is a PR, and the combined result is the new "specification"
   
   Both have benefits and downsides.
   
   My opinion is that we should not use the RFC approach, and instead work under the "specification" model.
   
   The reason is that, in my opinion, RFCs are hard to follow because they were made in a specific moment in time and no longer updated. If updated, there is a new RFC that does that, which requires some form of consolidation (e.g. RFCs have the term "amends", "superseded by", "revoked", PEP also).
   
   My opinion is that, with PRs, git history and git blame, there is no need to store the "deltas" (RFCs) in the repository itself, and we should instead offer the consolidated, up-to-date picture of the specification.
   
   This was the rational of the original issue, at least.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

alamb merged pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao edited a comment on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

jorgecarleitao edited a comment on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-848039362

So, a question that I would like to bring before committing to RFCs. I see two ways of approaching this:

1. the delta approach: each RFC may change previous RFCs, becoming the "latest" or a mixture of
2. the specification approach: each change is a PR, and the combined result is the new "specification"

Both have benefits and downsides.

My opinion is that we should not use the RFC approach, and instead work under the "specification" model.

The reason is that, in my opinion, RFCs are hard to follow because they were made in a specific moment in time and no longer updated. If updated, there is a new RFC that does that, which requires some form of consolidation (e.g. RFCs have the term "amends", "superseded by", "revoked", PEP also).

My opinion is that, with PRs, git history and git blame, there is no need to store the "deltas" (RFCs) in the repository itself, and we should instead offer the consolidated, up-to-date picture of the specification.

This was the rational of the original issue, at least.

TL;DR: an RFC is a "delta", the specification is the "state". An RFC/PR brings the state from the previous state to a new state. IMO we should have the specification on the repo and use PRs to track deltas (as opposed to having the deltas themselves in the repo)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-849084170


   Looks like we just need a RAT (apache copyright statement) to get a clean CI run and merge it. Perhaps we can add some section to the developer's guide once implemented with the "here is how output names are created" with a link to this RFC for the rationale


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-848552136


   @jorgecarleitao I agree with you, I also don't want this thing to get too formal and create unnecessary frictions whenever we need to update it. My original intent was a hybrid model of RFCs and Specs. Where we allow minor and backwards compatible changes to RFC directly (tracked through git), but require new RFCs for bigger changes. I will convert it to specs so people can walk in with the right expectation. We can come back to the RFC models in the future if we want to get serious about proposing changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639460465



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       @Dandandan I documented a survey of behavior from mysql/sqlite/postgres/spark in this doc as well, for example: https://github.com/houqp/arrow-datafusion/blob/qp_rfc/docs/rfcs/output-field-name-semantic.md#function-with-operators. Basically mysql and sqlite use the raw user query as the column name, postgres throws in the towel and just use `?column?` for everything while spark SQL constructs the column name based on the expression.
   
   I picked Spark's behavior because it's the one that is the closest to what we had at the time. But since you already implemented mysql and sqlite's behavior since then, i am happy to update the doc to account for that. In this case, we need two sets of rules, one for SQL queries, which is to just reuse what's provided in the query. The other one for dataframe queries, which is what I outlined in this doc.
   
   UPDATE: after taking a second look at #280, turns out the PR is closed due to the issue you mentioned above. I am now leaning back towards not preserving user provided names from query to keep things simple. It's one less thing to worry about and keeps the rules simple so we can apply the same set of rules for outputs produced from both SQL queries and dataframe queries. Let me know if you have a strong opinion on this though.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638980734



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       SQLite: ![image](https://user-images.githubusercontent.com/163737/119536416-5b02a000-bd89-11eb-9875-62918c2a8cc6.png)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#issuecomment-848607838


   Thanks @houqp ! I am 100% with you here.
   
   Wrt to the content itself, I am a big +1 here. We could discuss the finer print but my approval reads as: this is a fantastic first start; we may need some more iterations here; I would like to avoid committing to the exact wording at this point and instead allow further iterations on the existing text (i.e. without requiring another proposal / RFC, long text, and/or a way to consolidate texts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639461793



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`
+* Literal string MUST not be wrapped with quotes or double quotes.
+  * `SELECT 'foo'` SHOULD result in field name: `foo`
+* Operator expressions MUST be wrapped with parentheses.
+  * `SELECT -2` SHOULD result in field name: `(- 2)`

Review comment:
       This came from the rule below: `Operator and operand MUST be separated by spaces.`. I don't have a strong opinion on this. Picked this because this is what spark does today. We can also change the rule below to require no spaces between operator and operands. Which one do you prefer?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #422: add output field name rfc

Posted by GitBox <gi...@apache.org>.

Dandandan commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r638981442



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       MariaDB:
   ![image](https://user-images.githubusercontent.com/163737/119536547-7d94b900-bd89-11eb-8d94-79c5dee5525f.png)
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org