You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "MaxGekk (via GitHub)" <gi...@apache.org> on 2023/02/22 18:13:11 UTC

[GitHub] [spark] MaxGekk opened a new pull request, #40126: [WIP][SPARK-40822][SQL] Stable derived column aliases

MaxGekk opened a new pull request, #40126:
URL: https://github.com/apache/spark/pull/40126

   ### What changes were proposed in this pull request?
   In the PR, I propose to change auto-generation of column aliases (the case when an user doesn't assign any alias explicitly). Before the changes, Spark SQL generates such alias from `Expression` but this PR proposes to take the parse tree (output of lexer), and generate an alias using the term tokens from the tree.
   
   New helper function `ParserUtils.toExprAlias` takes a `ParseTree` from `Antlr4`, and converts it to a `String` using following simple rules:
   1. Adds a gap after every terminal node (`TerminalNodeImpl`) except of `(<[.`
   2. Removes a gap before `(), <>, []` and `,.` 
   
   For example, the sequence of tokens "(", "columnA", "+", "1", ")" is converted to the alias "(columnA + 1)"
   
   Closes #39332 
   
   ### Why are the changes needed?
   To improve user experience with Spark SQL. It is always best practice to name the result of any expressions in a queries select list,  if one plans to reference them later. This yields the most readable results and stability. However, sometimes queries are generated or we’re just lazy and trust in the auto generated names. The problem is that the auto-generated names are produced by pretty printing the expression tree which is, while “generally” readable, not meant to be stable across long durations of time. For example:
   ```sql
   spark-sql> DESC SELECT substring('hello', 5);
   substring(hello, 5, 2147483647)	string
   ```
   the auto-generated column alias `substring(hello, 5, 2147483647)` contains not-obvious elements.
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   
   ### How was this patch tested?
   By existing test suites.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119544380


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",

Review Comment:
   It makes me think if we should always use the standard SQL syntax to generate the alias name, instead of using the original SQL string. The string literal accepts two quote symbols and seems better to normalize it when generating alias names. cc @srielau 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119544894


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1L" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "interval 3 year 1 month",
+        "x'00'" -> "x'00'",
+        // Preserve case
+        "CAST(1 as tinyint)" -> "CAST(1 as tinyint)",
+        // Brackets
+        "getbit(11L, 2 + 1)" -> "getbit(11L,2+1)",
+        "string(int(shiftleft(int(-1), 31))+1)" -> "string(int(shiftleft(int(-1),31))+1)",
+        "map(1, 'a') [ 5 ]" -> "map(1,'a')[5]",
+        // Preserve type
+        "CAST('123.a' AS long)" -> "CAST('123.a'AS long)",

Review Comment:
   this one is not very ideal



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1138270078


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala:
##########
@@ -259,4 +260,48 @@ object ParserUtils {
       }
     }
   }
+
+  /**
+   * Converts the given parse tree to an alias by

Review Comment:
   `Normalizes the expression parser tree to a SQL string which will be used to generate expression alias`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala:
##########
@@ -259,4 +260,48 @@ object ParserUtils {
       }
     }
   }
+
+  /**
+   * Converts the given parse tree to an alias by

Review Comment:
   `Normalizes the expression parser tree to a SQL string which will be used to generate the expression alias`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119545189


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1L" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "interval 3 year 1 month",
+        "x'00'" -> "x'00'",
+        // Preserve case
+        "CAST(1 as tinyint)" -> "CAST(1 as tinyint)",
+        // Brackets
+        "getbit(11L, 2 + 1)" -> "getbit(11L,2+1)",
+        "string(int(shiftleft(int(-1), 31))+1)" -> "string(int(shiftleft(int(-1),31))+1)",
+        "map(1, 'a') [ 5 ]" -> "map(1,'a')[5]",
+        // Preserve type
+        "CAST('123.a' AS long)" -> "CAST('123.a'AS long)",
+        // Spaces
+        "'1' = 1" -> "'1'=1",
+        "upper('a') = upper('A')" -> "upper('a')=upper('A')",
+        "FLOOR(5, 0)" -> "FLOOR(5,0)",
+        "-1" -> "-1",
+        "1 in (1.0)" -> "1 in(1.0)",

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1142240653


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -488,16 +477,13 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case e if !e.resolved => u
             case g: Generator => MultiAlias(g, Nil)
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
-            case e: ExtractValue =>
-              if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
-              } else {
-                Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))
-              }
             case e if optGenAliasFunc.isDefined =>
               Alias(child, optGenAliasFunc.get.apply(e))()
             case l: Literal => Alias(l, toPrettySQL(l))()
             case e =>
+              val metaForAutoGeneratedAlias = new MetadataBuilder()
+                .putString("__autoGeneratedAlias", "true")
+                .build()

Review Comment:
   done. Please, double check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1129047990


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -465,7 +465,20 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
   }
 
   /**
-   * Replaces [[UnresolvedAlias]]s with concrete aliases.
+   * Replaces [[UnresolvedAlias]]s with concrete aliases by applying the following rules:
+   *   1. Use the specified name of named expressions;
+   *   2. Derive a stable alias from the original normalized SQL text except of:
+   *     2.1. A multipart identifier (column, field, mapkey, ...) -> the right most identifier.
+   *          For example: a.b.c => c
+   *     2.2. A CAST, or try_cast -> the argument of the cast.
+   *          For example: CAST(c1 AS INT) => c1
+   *     2.3. A map lookup with a literal -> the map key.
+   *          For example: map[5] => 5
+   *     2.4. Squeezing out unwanted whitespace or comments.
+   *          For example: T.c1 + /* test */ foo( 5 ) => T.c1+foo(5)
+   *     2.5. Normalize SQL string literals when ANSI mode is enabled and the SQL config
+   *          `spark.sql.ansi.doubleQuotedIdentifiers` is set to `true`.
+   *          For example: "abc" => 'abc'

Review Comment:
   Updated the comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1141983844


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -488,16 +477,13 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case e if !e.resolved => u
             case g: Generator => MultiAlias(g, Nil)
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
-            case e: ExtractValue =>
-              if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
-              } else {
-                Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))
-              }
             case e if optGenAliasFunc.isDefined =>
               Alias(child, optGenAliasFunc.get.apply(e))()
             case l: Literal => Alias(l, toPrettySQL(l))()
             case e =>
+              val metaForAutoGeneratedAlias = new MetadataBuilder()
+                .putString("__autoGeneratedAlias", "true")
+                .build()

Review Comment:
   To be safe and not change DataFrame behavior, can we not add this metadata if the expression is extractOnly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1460079053

   @cloud-fan @srielau Could you take a look at the PR, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1129383356


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +510,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>
               if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
+                Alias(e, getAlias(e, optGenAliasFunc))()
               } else {
-                Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))
+                Alias(e, getAlias(e, optGenAliasFunc))(
+                  explicitMetadata = Some(metaForAutoGeneratedAlias))
               }
             case e if optGenAliasFunc.isDefined =>
               Alias(child, optGenAliasFunc.get.apply(e))()
-            case l: Literal => Alias(l, toPrettySQL(l))()
+            case l: Literal => Alias(l, getAlias(l, optGenAliasFunc))()
             case e =>
-              Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))

Review Comment:
   we don't need to change this line, because the `case e if optGenAliasFunc.isDefined =>` will be hit first, and `optGenAliasFunc` must be none when we reach here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1129382014


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +510,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>
               if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
+                Alias(e, getAlias(e, optGenAliasFunc))()
               } else {
-                Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))
+                Alias(e, getAlias(e, optGenAliasFunc))(
+                  explicitMetadata = Some(metaForAutoGeneratedAlias))
               }
             case e if optGenAliasFunc.isDefined =>
               Alias(child, optGenAliasFunc.get.apply(e))()
-            case l: Literal => Alias(l, toPrettySQL(l))()
+            case l: Literal => Alias(l, getAlias(l, optGenAliasFunc))()

Review Comment:
   is `l.sql` a simpler solution?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1132294867


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +510,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>
               if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
+                Alias(e, getAlias(e, optGenAliasFunc))()
               } else {
-                Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))
+                Alias(e, getAlias(e, optGenAliasFunc))(
+                  explicitMetadata = Some(metaForAutoGeneratedAlias))
               }
             case e if optGenAliasFunc.isDefined =>
               Alias(child, optGenAliasFunc.get.apply(e))()
-            case l: Literal => Alias(l, toPrettySQL(l))()
+            case l: Literal => Alias(l, getAlias(l, optGenAliasFunc))()
             case e =>
-              Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))

Review Comment:
   Yep, I over-protected myself to catch all cases of `toPrettySQL()`. Let me revert this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1138274341


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala:
##########
@@ -259,4 +260,48 @@ object ParserUtils {
       }
     }
   }
+
+  /**
+   * Converts the given parse tree to an alias by
+   * - Adding a gap between two terms when both contains only either letters or digits.
+   * - Upper casing keywords and numeric literals.
+   * - Converting double quoted string literals to single quoted.

Review Comment:
   this make the function unstable as it relies on a config, let's drop as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119846624


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -669,17 +669,26 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
     )
   }
 
+  private def getAliasFunc(alias: => String): Option[Expression => String] = {
+    if (conf.getConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED)) {
+      Some(_ => alias)
+    } else {
+      None
+    }
+  }
+
   override def visitNamedExpressionSeq(
-      ctx: NamedExpressionSeqContext): Seq[Expression] = {
+      ctx: NamedExpressionSeqContext): Seq[(Expression, Option[Expression => String])] = {
     Option(ctx).toSeq
       .flatMap(_.namedExpression.asScala)
-      .map(typedVisit[Expression])
+      .map(ctx => (typedVisit[Expression](ctx), getAliasFunc(toExprAlias(ctx))))

Review Comment:
   Just in case, this function handle not only named expressions but others too. For example:
   ```sql
   select 1;
   ```
   is handled by `visitNamedExpressionSeq()` but not by `visitExpressionSeq()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119864580


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1L" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "interval 3 year 1 month",
+        "x'00'" -> "x'00'",
+        // Preserve case
+        "CAST(1 as tinyint)" -> "CAST(1 as tinyint)",
+        // Brackets
+        "getbit(11L, 2 + 1)" -> "getbit(11L,2+1)",
+        "string(int(shiftleft(int(-1), 31))+1)" -> "string(int(shiftleft(int(-1),31))+1)",
+        "map(1, 'a') [ 5 ]" -> "map(1,'a')[5]",
+        // Preserve type
+        "CAST('123.a' AS long)" -> "CAST('123.a'AS long)",

Review Comment:
   yep, what would you propose? Detect all string literals, and handle them as all chars terms?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srielau commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1460318660

   I have lost track of the externally described rules. Do you have an updated summary?
   Especially since we recently discussed about what to do with whitespace
   Also, do we plan to leave this OFF by default? If we don't make this default it will never get adoption because most folk just won't care. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1151062611


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "INTERVAL3YEAR1MONTH",
+        "x'00'" -> "x'00'",
+        // Preserve case
+        "CAST(1 as tinyint)" -> "CAST(1AStinyint)",

Review Comment:
   This PR https://github.com/apache/spark/pull/40565 does this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1452182391

   @srielau Could you take a look at the PR one more time, please.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [WIP][SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119115819


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {

Review Comment:
   @cloud-fan @srielau I have extracted distinguish cases from the PR https://github.com/apache/spark/pull/39332 . If I see some missed cases, please, let me know. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1121440514


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",

Review Comment:
   How should I distinguish double quoted string literals from double quoted identifiers on this level? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1129384626


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +510,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>

Review Comment:
   to match multipart identifier, we need to match a chain of `GetStructField` or `GetArrayStructField`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119542887


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +494,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>
               if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
+                Alias(e, getAlias(e, optGenAliasFunc))()

Review Comment:
   can we check where we use `metaForAutoGeneratedAlias` and see if we should add it here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119543585


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:
##########
@@ -669,17 +669,26 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit
     )
   }
 
+  private def getAliasFunc(alias: => String): Option[Expression => String] = {
+    if (conf.getConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED)) {
+      Some(_ => alias)
+    } else {
+      None
+    }
+  }
+
   override def visitNamedExpressionSeq(
-      ctx: NamedExpressionSeqContext): Seq[Expression] = {
+      ctx: NamedExpressionSeqContext): Seq[(Expression, Option[Expression => String])] = {
     Option(ctx).toSeq
       .flatMap(_.namedExpression.asScala)
-      .map(typedVisit[Expression])
+      .map(ctx => (typedVisit[Expression](ctx), getAliasFunc(toExprAlias(ctx))))

Review Comment:
   it seems we can always create `UnresolvedAlias` here. `UnresolvedAlias` is a noop if its child is already a named expression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1119866910


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1L" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "interval 3 year 1 month",
+        "x'00'" -> "x'00'",
+        // Preserve case
+        "CAST(1 as tinyint)" -> "CAST(1 as tinyint)",
+        // Brackets
+        "getbit(11L, 2 + 1)" -> "getbit(11L,2+1)",
+        "string(int(shiftleft(int(-1), 31))+1)" -> "string(int(shiftleft(int(-1),31))+1)",
+        "map(1, 'a') [ 5 ]" -> "map(1,'a')[5]",
+        // Preserve type
+        "CAST('123.a' AS long)" -> "CAST('123.a'AS long)",
+        // Spaces
+        "'1' = 1" -> "'1'=1",
+        "upper('a') = upper('A')" -> "upper('a')=upper('A')",
+        "FLOOR(5, 0)" -> "FLOOR(5,0)",
+        "-1" -> "-1",
+        "1 in (1.0)" -> "1 in(1.0)",

Review Comment:
   > ditto
   
   This is different case from the previous one. How can I distinguish the two cases?
   ```floor(``` and ```in(```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1123410550


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",

Review Comment:
   I have added such conversion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1121428443


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +494,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>
               if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
+                Alias(e, getAlias(e, optGenAliasFunc))()

Review Comment:
   It is used in 2 places, and both places are covered by `getAlias`. Seems not need to add anything. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1142239907


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",

Review Comment:
   I would like to convert types and their literal constructors in a separate PR if you don't mind.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1460335905

   > Do you have an updated summary?
   
   Yes, it is very easy to check - just look at the last commit.
   
   > Especially since we recently discussed about what to do with whitespace
   
   ? @srielau Please, look at the PR changes first of all.
   
   > Also, do we plan to leave this OFF by default?
   
   Yes, Serge could you look at the PR's description and changes before asking the questions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srielau commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1124536818


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -465,7 +465,20 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
   }
 
   /**
-   * Replaces [[UnresolvedAlias]]s with concrete aliases.
+   * Replaces [[UnresolvedAlias]]s with concrete aliases by applying the following rules:
+   *   1. Use the specified name of named expressions;
+   *   2. Derive a stable alias from the original normalized SQL text except of:
+   *     2.1. A multipart identifier (column, field, mapkey, ...) -> the right most identifier.
+   *          For example: a.b.c => c
+   *     2.2. A CAST, or try_cast -> the argument of the cast.
+   *          For example: CAST(c1 AS INT) => c1
+   *     2.3. A map lookup with a literal -> the map key.
+   *          For example: map[5] => 5
+   *     2.4. Squeezing out unwanted whitespace or comments.
+   *          For example: T.c1 + /* test */ foo( 5 ) => T.c1+foo(5)
+   *     2.5. Normalize SQL string literals when ANSI mode is enabled and the SQL config
+   *          `spark.sql.ansi.doubleQuotedIdentifiers` is set to `true`.
+   *          For example: "abc" => 'abc'

Review Comment:
   This description seems incomplete. I see examples below with \t, \r \n in strings. We preserve them? I'm doubtful that's useful).
   Also you seem to have some rules around whitespace.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srielau commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1460908751

   > > Do you have an updated summary?
   > 
   > Yes, it is very easy to check - just look at the last commit.
   > 
   > > Especially since we recently discussed about what to do with whitespace
   > 
   > ? @srielau Please, look at the PR changes first of all.
   > 
   > > Also, do we plan to leave this OFF by default?
   > 
   > Yes, Serge could you look at the PR's description and changes before asking the questions.
   You sum up the root cause for a lot of our QA troubles right there....
   Please assure that the description of the PR is accurate and up to date (or have a pointer to a google-doc that does the same).
   We cannot reciew code against itself. We have to review it against a common understanding on what it is meant to do.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1132295896


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -490,15 +510,17 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
             case c @ Cast(ne: NamedExpression, _, _, _) => Alias(c, ne.name)()
             case e: ExtractValue =>
               if (extractOnly(e)) {
-                Alias(e, toPrettySQL(e))()
+                Alias(e, getAlias(e, optGenAliasFunc))()
               } else {
-                Alias(e, toPrettySQL(e))(explicitMetadata = Some(metaForAutoGeneratedAlias))
+                Alias(e, getAlias(e, optGenAliasFunc))(
+                  explicitMetadata = Some(metaForAutoGeneratedAlias))
               }
             case e if optGenAliasFunc.isDefined =>
               Alias(child, optGenAliasFunc.get.apply(e))()
-            case l: Literal => Alias(l, toPrettySQL(l))()
+            case l: Literal => Alias(l, getAlias(l, optGenAliasFunc))()

Review Comment:
   The case above should catch the cases of auto-generated stable aliases. This line can be reverted back.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1141990071


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",

Review Comment:
   we should make `date` a keyword. Are we going to do it in a followup?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1141984097


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2704,7 +2690,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
     }
 
     private def trimAlias(expr: NamedExpression): Expression = expr match {
-      case UnresolvedAlias(child, _) => child
+      case ua: UnresolvedAlias => ua.child

Review Comment:
   unnecessary change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1153316438


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",

Review Comment:
   This PR https://github.com/apache/spark/pull/40593 does the work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] srielau commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "srielau (via GitHub)" <gi...@apache.org>.
srielau commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1120535170


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",

Review Comment:
   You mean convert " to '  for String literals? +1
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1129378877


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -465,7 +465,23 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
   }
 
   /**
-   * Replaces [[UnresolvedAlias]]s with concrete aliases.
+   * Replaces [[UnresolvedAlias]]s with concrete aliases by applying the following rules:
+   *   1. Use the specified name of named expressions;
+   *   2. Derive stable aliases from the lexer tree of the original SQL text by concatenating of
+   *      terms via a single space, and applying the additional rules:
+   *     2.0. Don't add a space between terms if at least one of them contains a char which is not
+   *          a letter and a digit.
+   *     2.1. Normalize SQL string literals when ANSI mode is enabled and the SQL config
+   *          `spark.sql.ansi.doubleQuotedIdentifiers` is set to `true`.
+   *          For example: "abc" => 'abc'
+   *     2.2. A multipart identifier (column, field, mapkey, ...) -> the right most identifier.
+   *          For example: a.b.c => c
+   *     2.3. A CAST, or try_cast -> the argument of the cast.
+   *          For example: CAST(c1 AS INT) => c1
+   *     2.4. A map lookup with a literal -> the map key.
+   *          For example: map[5] => 5

Review Comment:
   is this really what we want? what was the previous behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1138273636


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala:
##########
@@ -259,4 +260,48 @@ object ParserUtils {
       }
     }
   }
+
+  /**
+   * Converts the given parse tree to an alias by
+   * - Adding a gap between two terms when both contains only either letters or digits.

Review Comment:
   let's drop this as it can go very complicated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1140503003


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "INTERVAL3YEAR1MONTH",

Review Comment:
   @cloud-fan As we discussed offline, I removed gaps between terms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1141990655


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",
+        "interval 3 year 1 month" -> "INTERVAL3YEAR1MONTH",
+        "x'00'" -> "x'00'",
+        // Preserve case
+        "CAST(1 as tinyint)" -> "CAST(1AStinyint)",

Review Comment:
   ditto, all types should be keyword.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a diff in pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on code in PR #40126:
URL: https://github.com/apache/spark/pull/40126#discussion_r1142389921


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/ResolveAliasesSuite.scala:
##########
@@ -88,4 +94,46 @@ class ResolveAliasesSuite extends AnalysisTest {
     checkAliasName(t1.select(DateSub(Literal(Date.valueOf("2021-01-18")), Literal(null))),
       "date_sub(DATE '2021-01-18', NULL)")
   }
+
+  test("SPARK-40822: Stable derived column aliases") {
+    withSQLConf(SQLConf.STABLE_DERIVED_COLUMN_ALIAS_ENABLED.key -> "true") {
+      Seq(
+        // Literals
+        "' 1'" -> "' 1'",
+        """"abc"""" -> """"abc"""",
+        """'\t\n xyz \t\r'""" -> """'\t\n xyz \t\r'""",
+        "1l" -> "1L", "1S" -> "1S",
+        "date'-0001-1-28'" -> "date'-0001-1-28'",

Review Comment:
   I opened the JIRA https://issues.apache.org/jira/browse/SPARK-42873



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk commented on PR #40126:
URL: https://github.com/apache/spark/pull/40126#issuecomment-1477333100

   Merging to master. Thank you, @srielau and @cloud-fan for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk closed pull request #40126: [SPARK-40822][SQL] Stable derived column aliases

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.
MaxGekk closed pull request #40126: [SPARK-40822][SQL] Stable derived column aliases
URL: https://github.com/apache/spark/pull/40126


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org