You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/14 16:47:02 UTC

[GitHub] [spark] minyyy opened a new pull request #35850: [SPARK-38529] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

minyyy opened a new pull request #35850:
URL: https://github.com/apache/spark/pull/35850


   ### What changes were proposed in this pull request?
   1. Explicitly return in GeneratorNestedColumnAliasing when the generator is not Explode.
   2. An off-hand code refactor to make the code clearer.
   
   ### Why are the changes needed?
   GeneratorNestedColumnAliasing does not handle other generators correctly. We only try to rewrite the generator for Explode but try to rewrite all ExtractValue expressions. This can cause inconsistency for non-Explode generators.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35850:
URL: https://github.com/apache/spark/pull/35850#issuecomment-1068184845


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827211803



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline

Review comment:
       We don't actually handle `Inline`. `canPruneGenerator` wrongly lists `Stack` and `Inline`. Even they enter this rule, they won't be handled, I think. It's better to make `canPruneGenerator` consistent with the behavior.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827201964



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.

Review comment:
       That's the next fix I will make. It is true that we don't limit ExtractValue types but it is incorrect. You can think about Explode and GetArrayItem, GetMapValue, GetArrayStructFields, none of the 3 expressions can be pushed down. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #35850:
URL: https://github.com/apache/spark/pull/35850#issuecomment-1068623483


   BTW, I will only merge this to master and not backport it. It is because previously user queries are not failed. I will avoid backpointing this to prevent possible regression. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827204456



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -348,6 +382,11 @@ object GeneratorNestedColumnAliasing {
         case _ =>
       }
 
+      g.generator match {
+        case _: ExplodeBase =>
+        case _ => return Some(pushedThrough)
+      }

Review comment:
       Discussed below.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827503785



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -447,9 +478,6 @@ object GeneratorNestedColumnAliasing {
    */
   def canPruneGenerator(g: Generator): Boolean = g match {
     case _: Explode => true
-    case _: Stack => true
-    case _: PosExplode => true

Review comment:
       reverted.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827238756



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline

Review comment:
       I think it's okay to do. As they are not actually handled now, excluding them won't change the behavior and can prevent possible bugs in the future. When we can handle them in the rule, we can add them back explicitly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #35850:
URL: https://github.com/apache/spark/pull/35850#issuecomment-1068623483


   BTW, I will only merge this to master and not backport it. It is because previously user queries are not failed. So I change the JIRA from bug to improvement. I will avoid backpointing this to prevent possible regression. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827175844



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.

Review comment:
       Not sure if this is correct. We don't actually limit the `ExtractValue` types.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827173664



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -348,6 +382,11 @@ object GeneratorNestedColumnAliasing {
         case _ =>
       }
 
+      g.generator match {
+        case _: ExplodeBase =>
+        case _ => return Some(pushedThrough)
+      }

Review comment:
       I think you can just remove `Inline` from `canPruneGenerator`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827223220



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline

Review comment:
       Yes, we do not actually support `Inline` and `Stack`. But you can see that we split the exprs to onGenerators and notOnGenerators and call regular NestedColumnAliasing for notOnGenerators. That optimization is valid and if we change the branch condition then we miss the regular NestedColumnAliasing optimization here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827224875



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.

Review comment:
       I will rephrase it - We do not support them for now. I agree that if we want we can support it if we add some expressions like `GetArrayMapValue`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827565712



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,38 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.

Review comment:
       The first item looks a bit weird. For what `ExtractValue` can be pushed down, you can simply list them. I'm not sure "For [[ExtractValue]]s not on the output of the generator..." means. Do you mean on the output of generator?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827216842



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.

Review comment:
       We have a check in the rule. For certain generator input, e.g. Map, they will not be handled now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827488245



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -812,6 +812,18 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     val expected3 = contact.select($"name").rebalance($"name").select($"name.first").analyze
     comparePlans(optimized3, expected3)
   }
+
+  test("GeneratorNestedColumnAliasing does not pushdown for non-Explode") {

Review comment:
       Can you add JIRA ticket as prefix? E.g. `SPARK-38529: GeneratorNestedColumnAliasing does not pushdown for non-Explode`

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -292,7 +292,7 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized, expected)
   }
 
-  test("Nested field pruning for Project and Generate") {
+  test("Nested field pruning for Project and Explode") {

Review comment:
       This is unnecessary change. Can you revert it back?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -447,9 +478,6 @@ object GeneratorNestedColumnAliasing {
    */
   def canPruneGenerator(g: Generator): Boolean = g match {
     case _: Explode => true
-    case _: Stack => true
-    case _: PosExplode => true

Review comment:
       Do we need to remove `PosExplode`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r826170557



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -395,12 +434,11 @@ object GeneratorNestedColumnAliasing {
             val updatedGenerate = rewrittenG.copy(generatorOutput = updatedGeneratorOutput)
 
             // Replace nested column accessor with generator output.
-            val attrExprIdsOnGenerator = attrToExtractValuesOnGenerator.keys.map(_.exprId).toSet
+            val generateOutputAttrExprId = attrToExtractValuesOnGenerator.keys.map(_.exprId).head
+            val generateOutputAttr = updatedGenerate.output
+              .find(_.exprId == generateOutputAttrExprId).get

Review comment:
       I changed the .getOrElse to .get to make it clearer - we should never fail to find the generator output attribute, if we want to be safer, we can use a require to check and return here, but should not use getOrElse because IMO if we run into the "else" branch it means an error.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #35850:
URL: https://github.com/apache/spark/pull/35850#issuecomment-1067519500


   cc @viirya FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827234334



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline

Review comment:
       I am fine with changing `canPruneGenerator` though. That's actually what I wanted to do. I was just conservative and did not want to change any existing traffic so I did not do it that way. If we agree that changing it is fine, let's do it that way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827566261



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,38 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression. *For now, we only support Explode*.
+    //   2.1 Inline
+    //       Inline takes an input of ARRAY<STRUCT<field1, field2>>, and returns an output of
+    //       STRUCT<field1, field2>, the output field can be directly accessed by name "field1".
+    //       In this case, we should not try to push down the ExtractValue expressions to the
+    //       input of the Inline. For example:
+    //       Project[field1.x AS x]
+    //       - Generate[ARRAY<STRUCT<field1: STRUCT<x: int>, field2:int>>, ..., field1, field2]
+    //       It is incorrect to push down the .x to the input of the Inline.
+    //       A valid field pruning would be to extract all the fields that are accessed by the
+    //       Project, and manually reconstruct an expression using those fields.
+    //   2.2 Explode
+    //       Explode takes an input of ARRAY<some_type> and returns an output of
+    //       STRUCT<col: some_type>. The default field name "col" can be overwritten.
+    //       If the input is MAP<key, value>, it returns STRUCT<key: key_type, value: value_type>.
+    //       For the array case, it is only valid to push down GetStructField. After push down,
+    //       the GetStructField becomes a GetArrayStructFields. Note that we cannot push down
+    //       GetArrayStructFields, since the pushed down expression will operate on an array of
+    //       array which is invalid.
+    //   2.3 Stack
+    //       Stack takes a sequence of expressions, and returns an output of
+    //       STRUCT<col0: some_type, col1: some_type, ...>
+    //       The push down is doable but more complicated in this case as the expression that
+    //       operates on the col_i of the output needs to pushed down to every (kn+i)-th input
+    //       expression where n is the total number of columns (or struct fields) of the output.

Review comment:
       For Stack and Inline, it looks verbose and doesn't help on reading the code. Can you just mention we only support Explode.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827565230



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,38 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:

Review comment:
       Can you rephrase the comment? It looks a bit weird. E.g., I'm not sure what the first sentence means/wants to mean.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r826169441



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -395,12 +434,11 @@ object GeneratorNestedColumnAliasing {
             val updatedGenerate = rewrittenG.copy(generatorOutput = updatedGeneratorOutput)
 
             // Replace nested column accessor with generator output.
-            val attrExprIdsOnGenerator = attrToExtractValuesOnGenerator.keys.map(_.exprId).toSet
+            val generateOutputAttrExprId = attrToExtractValuesOnGenerator.keys.map(_.exprId).head

Review comment:
       I changed the toSet and find way to toSet.head to make it clearer - we should only have one generator output here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827207669



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline
+    //       Inline takes an input of ARRAY<STRUCT<field1, field2>>, and returns an output of
+    //       STRUCT<field1, field2>, the output field can be directly accessed by name "field1".
+    //       In this case, we should not try to push down the ExtractValue expressions to the
+    //       input of the Inline. For example:
+    //       Project[field1.x AS x]
+    //       - Generate[ARRAY<STRUCT<field1: STRUCT<x: int>, field2:int>>, ..., field1, field2]
+    //       It is incorrect to push down the .x to the input of the Inline.
+    //       A valid field pruning would be to extract all the fields that are accessed by the
+    //       Project, and manually reconstruct an expression using those fields.
+    //   2.2 Explode
+    //       Explode takes an input of ARRAY<some_type> and returns an output of
+    //       STRUCT<col: some_type>. The default field name "col" can be overwritten.
+    //       If the input is MAP<key, value>, it returns STRUCT<key: key_type, value: value_type>.
+    //       For the array case, it is only valid to push down GetStructField. After push down,
+    //       the GetStructField becomes a GetArrayStructFields. Note that we cannot push down
+    //       GetArrayStructFields, since the pushed down expression will operate on an array of
+    //       array which is invalid.
+    //   2.3 Stack
+    //       Stack takes a sequence of expressions, and returns an output of
+    //       STRUCT<col0: some_type, col1: some_type, ...>
+    //       The push down is doable but more complicated in this case as the expression that
+    //       operates on the col_i of the output needs to pushed down to every (kn+i)-th input
+    //       expression where n is the total number of columns (or struct fields) of the output.
+
+    // To be safe, we only support

Review comment:
       Looks like a broken sentence.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827204043



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline

Review comment:
       The reason that I don't remove `Inline` from `canPruneGenerator` is that I don't want to change our existing behavior of the pushdown of exprs not on generators, if I change `canPruneGenerator` then less expressions enter this branch, we call NestedColumnAliasing for less expressions. But if you are fine with it, I can simply change to it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on a change in pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827201260



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,40 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.
+    // 2. For [[ExtractValue]]s on the output of the generator, the situation depends on the type
+    //    of the generator expression.
+    //   2.1 Inline

Review comment:
       This seems correct. As Incline flattens the nested columns. There is no `ExtractValue` for `field1` anymore in the Project. You can just remove `Inline` from `canPruneGenerator`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on pull request #35850: [SPARK-38529][SQL] Fix a bug that GeneratorNestedColumnAliasing is incorrectly applied to non-Explode generators.

Posted by GitBox <gi...@apache.org>.
minyyy commented on pull request #35850:
URL: https://github.com/apache/spark/pull/35850#issuecomment-1068547994


   Changed `canPruneGenerator` per the comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r827504435



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -292,7 +292,7 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized, expected)
   }
 
-  test("Nested field pruning for Project and Generate") {
+  test("Nested field pruning for Project and Explode") {

Review comment:
       Done.

##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -812,6 +812,18 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     val expected3 = contact.select($"name").rebalance($"name").select($"name.first").analyze
     comparePlans(optimized3, expected3)
   }
+
+  test("GeneratorNestedColumnAliasing does not pushdown for non-Explode") {

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] minyyy commented on a change in pull request #35850: [SPARK-38529][SQL] Prevent GeneratorNestedColumnAliasing to be applied to non-Explode generators

Posted by GitBox <gi...@apache.org>.
minyyy commented on a change in pull request #35850:
URL: https://github.com/apache/spark/pull/35850#discussion_r828284529



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -321,6 +321,38 @@ object GeneratorNestedColumnAliasing {
     // need to prune nested columns through Project and under Generate. The difference is
     // when `nestedSchemaPruningEnabled` is on, nested columns will be pruned further at
     // file format readers if it is supported.
+
+    // There are [[ExtractValue]] expressions on or not on the output of the generator. Generator
+    // can also have different types:
+    // 1. For [[ExtractValue]]s not on the output of the generator, theoretically speaking, there
+    //    lots of expressions that we can push down, including non ExtractValues and GetArrayItem
+    //    and GetMapValue. But to be safe, we only handle GetStructField and GetArrayStructFields.

Review comment:
       ```
   val (attrToExtractValuesOnGenerator, attrToExtractValuesNotOnGenerator) =
     attrToExtractValues.partition { case (attr, _) =>
       attr.references.subsetOf(generatorOutputSet) }
   
   val pushedThrough = NestedColumnAliasing.rewritePlanWithAliases(
     plan, attrToExtractValuesNotOnGenerator)
   ```
   
   We call `NestedColumnAliasing.rewritePlanWithAliases` for `ExtractValue`s not on the output of the generators. The above code snippet is also the reason that I was hesitated to modify `canPruneGenerator`. If we modify `canPruneGenerator`, we miss the opportunity for `Inline` and `Stack` to use `NestedColumnAliasing.rewritePlanWithAliases` on `attrToExtractValuesNotOnGenerator`. My point in the comment was to note the difference of optimizations able to apply to `attrToExtractValuesOnGenerator` and `attrToExtractValuesNotOnGenerator`. The former one is predicate pushdown and there are lots of expressions that can be pushed down, the latter one is the real "GeneratorNestedColumnAliasing" and only `GetStructField` is supported.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org