You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/16 08:57:23 UTC

[GitHub] [spark] EnricoMi commented on pull request #37407: [SPARK-39876][SQL] Add UNPIVOT to SQL syntax

EnricoMi commented on PR #37407:
URL: https://github.com/apache/spark/pull/37407#issuecomment-1249106578

   @cloud-fan I have introduced expression `UnpivotExpr` to replace the `(Seq[NamedExpression], Option[String]])`, which makes code more readable.
   
   But, this introduces the following change in behaviour / deviation from projection behaviour:
   
   ```scala
   spark.range(5).select(struct($"id").as("an")).select($"an.id").show()
   ```
   
   "an.id" gets alias "id":
   ```
   +---+
   | id|
   +---+
   |  0|
   |  1|
   |  2|
   |  3|
   |  4|
   +---+
   ```
   
   ```
   Project(UnresolvedAttribute("an.id"), plan)
     --> ResolveReferences rule -->
   Project(Alias(GetStructField(an#2.id), "id"), plan)
   ```
   
   ```scala
   spark.range(5).select(struct($"id").as("an")).unpivot(Array($"an.id"), Array($"an.id"), "col", "val").show()
   ```
   
   before introducing `UnpivotExpr`, both ids and values get alias "id" (as in select / `Project`):
   ```
   +---+---+---+
   | id|col|val|
   +---+---+---+
   |  0| id|  0|
   |  1| id|  1|
   |  2| id|  2|
   |  3| id|  3|
   |  4| id|  4|
   +---+---+---+
   ```
   
   after introducing `UnpivotExpr`, id "str.id" gets alias "id", value "str.id" does not get an alias and hence gets name "an.id":
   ```
   +---+-----+---+
   | id|  col|val|
   +---+-----+---+
   |  0|an.id|  0|
   |  1|an.id|  1|
   |  2|an.id|  2|
   |  3|an.id|  3|
   |  4|an.id|  4|
   +---+-----+---+
   ```
   
   Now that `UnpivotExpr` is the top level expression, inner `UnresolvedAttribute` / `GetStructField` does not get an alias:
   
   ```
   Unpivot(Seq(UnresolvedAttribute("an.id")), Seq(UnpivotExpr(Seq(UnresolvedAttribute("an.id")), ...)), ..., plan)
     --> ResolveReferences -->
   Unpivot(Seq(Alias(GetStructField(an#2.id), "id")), Seq(UnpivotExpr(Seq(GetStructField(an#2.id)), ...)), ..., plan)
   ```
   
   https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1770
   
   `CleanupAliases` rule is not the reason, the alias is being removed inside `ResolveReferences`.
   
   The only way to get to the old behaviour is a special treatment of `UnpivotExpr` in `QueryPlan.mapExpressions.recursiveTransform`:
   https://github.com/apache/spark/pull/37407/commits/9dd66b78ec817a53325d95900f18198dac9bc3b1#diff-ece55283a94dd23d3c04f8b9d8ae35937ccff67724be690ff30f76e9f8093c6eR211


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org