You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/26 07:15:52 UTC

[GitHub] [arrow-datafusion] houqp commented on a change in pull request #422: add output field name rfc

houqp commented on a change in pull request #422:
URL: https://github.com/apache/arrow-datafusion/pull/422#discussion_r639460465



##########
File path: docs/rfcs/output-field-name-semantic.md
##########
@@ -0,0 +1,236 @@
+# Datafusion output field name semantic
+
+Start Date: 2020-05-24
+
+## Summary
+
+Formally specify how Datafusion should construct its output field names based on
+provided user query.
+
+## Motivation
+
+By formalizing the output field name semantic, users will be able to access
+query output using consistent field names.
+
+## Detailed design
+
+The proposed semantic is chosen for the following reasons:
+
+* Ease of implementation, field names can be derived from physical expression
+without having to add extra logic to pass along arbitrary user provided input.
+Users are encouraged to use ALIAS expressions for full field name control.
+* Mostly compatible with Spark’s behavior except literal string handling.
+* Mostly backward compatible with current Datafusion’s behavior other than
+function name cases and parenthesis around operator expressions.
+
+###  Field name rules
+
+* All field names MUST not contain relation qualifier.
+  * Both `SELECT t1.id` and `SELECT id` SHOULD result in field name: `id`
+* Function names MUST be converted to lowercase.
+  * `SELECT AVG(c1)` SHOULD result in field name: `avg(c1)`

Review comment:
       @Dandandan I documented a survey of behavior from mysql/sqlite/postgres/spark in this doc as well, for example: https://github.com/houqp/arrow-datafusion/blob/qp_rfc/docs/rfcs/output-field-name-semantic.md#function-with-operators. Basically mysql and sqlite use the raw user query as the column name, postgres throws in the towel and just use `?column?` for everything while spark SQL constructs the column name based on the expression.
   
   I picked Spark's behavior because it's the one that is the closest to what we had at the time. But since you already implemented mysql and sqlite's behavior since then, i am happy to update the doc to account for that. In this case, we need two sets of rules, one for SQL queries, which is to just reuse what's provided in the query. The other one for dataframe queries, which is what I outlined in this doc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org