You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Serge Rielau (Jira)" <ji...@apache.org> on 2022/10/17 18:09:00 UTC
[jira] [Created] (SPARK-40822) Use stable derived-column-alias algorithm, suitable for CREATE VIEW

Serge Rielau created SPARK-40822:
------------------------------------

             Summary: Use stable derived-column-alias algorithm, suitable for CREATE VIEW 
                 Key: SPARK-40822
                 URL: https://issues.apache.org/jira/browse/SPARK-40822
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.4.0
            Reporter: Serge Rielau


Spark has the ability derive column aliases for expressions if no alias was provided by the user.
E.g.
CREATE TABLE T(c1 INT, c2 INT);
SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T);

This is a valuable feature. However, the current implementation works by pretty printing the expression from the logical plan.  This has multiple downsides:
 * The derived names can be unintuitive. For example the brackets in `(c1 + 1)` or outright ugly, such as:
SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T;
 * We cannot guarantee stability across versions since the logical lan of an expression may change.

The later is a major reason why we cannot allow CREATE VIEW without a column list except in "trivial" cases.

CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T;
Not allowed to create a permanent view `spark_catalog`.`default`.`v` without explicitly assigning an alias for expression (c1 + 1).

There are two way we can go about fixing this:
 # Stop deriving column aliases from the expression. Instead generate unique names such as `_col_1` based on their position in the select list. This is ugly and takes away the "nice" headers on result sets
 # Move the derivation of the name upstream. That is instead of pretty printing the logical plan we pretty print the lexer output, or a sanitized version of the expression as typed.
The statement as typed is stable by definition. The lexer is stable because i has no reason to change. And if it ever did we have a better chance to manage the change.

In this feature we propose the following semantic:
 # If the column alias can be trivially derived (some of these can stack), do so:
 ** a (qualified) column reference => the unqualified column identifier
cat.sch.tab.col => col
 ** A field reference => the fieldname
struct.field1.field2 => field2
 ** A cast(column AS type) => column
cast(col1 AS INT) => col1
 ** A map lookup with literal key => keyname
map.key => key
map['key'] => key
 ** A parameter less function => unqualified function name
current_schema() => current_schema
 # Take the lexer tokens of the expression, eliminate comments, and append them.
foo(tab1.c1 + /* this is a plus*/
1) => `foo(tab1.c1+1)`

 

Of course we wan this change under a config.
If the config is set we can allow CREATE VIEW to exploit this and use the derived expressions.

PS: The exact mechanics of formatting the name is very much debatable. E.g.spaces between token, squeezing out comments - upper casing - preserving quotes or double quotes...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org