You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/23 01:14:29 UTC

[GitHub] [spark] frankyin-factual opened a new pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

frankyin-factual opened a new pull request #28898:
URL: https://github.com/apache/spark/pull/28898


   This is to solve the schema pruning not working in window functions. This is a fairly limited version that only intend to solve issues within a `Filter` logical plan. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r447977410



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,87 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+
+    val query2 = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized2 = Optimize.execute(query2)
+    val aliases2 = collectGeneratedAliases(optimized2)
+    val expected2 = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases2(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases2(1)}".as(aliases2(0)))
+      .orderBy($"${aliases2(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized2, expected2)
+  }
+
+  test("Nested field pruning for Filter") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+    val query = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")

Review comment:
       Done




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068267






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448040612



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
             nestedFieldToAlias
               .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
               .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+          Some((attr.exprId, nestedFieldToAlias))
         } else {
           None
         }
       }
+      .groupBy(_._1) // To fix same ExprId mapped to different attribute instance

Review comment:
       Not necessarily only for the `Window` case, but by adding `Window/Filter/Sort`, this error can be surfaced more easily. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652786996


   **[Test build #124822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124822/testReport)** for PR 28898 at commit [`2bc6a1a`](https://github.com/apache/spark/commit/2bc6a1ae1f2fb7c657d95d5abd92615fdc95eaef).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656696370






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652831500






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282424






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448774792



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       No, I mean this comment thread.
   
   I am not sure if you are aware of it. The reason you need to deduplicate here, is because the semantically same `ExtractValue`s apply on attributes with different qualifier, e.g. there are two `name.first`, but one refers to `name` with qualifier `a` and another refers to qualifier `b`.
   
   I did a test using your query and cleaned up all qualifiers, it works well.
   
   And what I said in above comment is, you select arbitrary one `ExtractValue` from these `ExtractValue` with different qualifiers, but later we will look into the map using given `ExtractValue`. You might fail a case that you select the `name.first` with qualifier `a`, but later you look at the map using `name.first` with qualifier `b`.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199034


   **[Test build #125714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125714/testReport)** for PR 28898 at commit [`68cbfd2`](https://github.com/apache/spark/commit/68cbfd24f1f50266c5d1c5dfc24e29699f87c3e3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652668745


   Just did a rebase to squash all those commits. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451691041



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)

Review comment:
       It passed without pruning through window. The query selects `name.first` on top of the relation. So I am not sure why you include it here...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652840781


   > I think I come out a better idea for the discussion. But it is too late. I will send out a PR to your change tomorrow.
   
   Thanks @viirya 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445623584



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       How about use my proposal at https://github.com/apache/spark/pull/28898#pullrequestreview-437202483?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662237766


   Thank you for pinging me, @frankyin-factual , @maropu , @viirya . I'll take a look at this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445309526



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
     case _: Sample => true
     case _: RepartitionByExpression => true
     case _: Join => true
+    case x: Filter => x.child match {
+      case _: Window => true

Review comment:
       That won’t work because it seems causing an infinite loop in optimizer. It gives me error messages like running out of max iterations. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451209212



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+
+    val query2 = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized2 = Optimize.execute(query2)
+    val aliases2 = collectGeneratedAliases(optimized2)
+    val expected2 = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases2(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases2(1)}".as(aliases2(0)))
+      .orderBy($"${aliases2(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized2, expected2)
+  }
+
+  test("Nested field pruning for Filter") {

Review comment:
       It is not for only Filter. Maybe `Nested field pruning for Filter with other operators`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649224393


   > btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
   
   Just updated the PR. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656076133






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649153764


   **[Test build #124507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124507/testReport)** for PR 28898 at commit [`b1cad9a`](https://github.com/apache/spark/commit/b1cad9ad759c3e1d2ef9efd0f9390c6924e412df).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655365105


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448795376



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       Just added. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499394



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for window functions") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+    val query = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases(1)}".as(aliases(0)), $"window")
+      .where($"window" === 1 && $"${aliases(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for orderBy") {

Review comment:
       Why did you add the separate tests for orderBy/sortBy? They have the same plan, sort.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649311526






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448040082



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
             nestedFieldToAlias
               .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
               .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+          Some((attr.exprId, nestedFieldToAlias))

Review comment:
       It's related to the comment below. https://github.com/apache/spark/pull/28898#discussion_r447950809




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655368737






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448772028



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       I don't think `exprId` will has collision. `exprId` should be unique. If there is collision, the query plan should have some place wrong...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656922596


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125617/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068067


   **[Test build #126598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126598/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448692070



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -176,9 +192,16 @@ object NestedColumnAliasing {
         // By default, ColumnPruning rule uses `attr` already.
         if (nestedFieldToAlias.nonEmpty &&
             nestedFieldToAlias
-              .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
-              .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+              .foldLeft(Seq[ExtractValue]()) {
+                (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+                  curr._1 +: unique
+                } else {
+                  unique
+                }
+              }
+              .map { t => totalFieldNum(t.dataType)  }
+              .sum < totalFieldNum(attr._2)) {

Review comment:
       I think Its hard to read this part. Could you pull the left value out from the `if` condition?
   ```
   val xxx = yyy
   if (xxx < totalFieldNum(attr._2)) { ...
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448783457



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       Ok. Can you add a comment here?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       I see. I can see the bad thing about this is, you will have duplicate aliases for the semantically same `ExtractValue`s, e.g. two aliases for two `name.first`s.
   
   It may not be a big deal here, but it is possibly in the optimized query plan, you will have multiple gen_alias_xxx which refer to the same `ExtractValue`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657005554


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125654/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446508750



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,22 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    /**
+     * This is to solve a `LogicalPlan` like `Project`->`Filter`->`Window`.
+     * In this case, `Window` can be plan that is `canProjectPushThrough`.
+     * By adding this, it allows nested columns to be passed onto next stages.
+     * Currently, not adding `Filter` into `canProjectPushThrough` due to
+     * infinitely loop in optimizers during the predicate push-down rule.
+     */

Review comment:
       btw, do you know why the optimizer can hit the issue? I think its better to check the root cause for future activities if possible.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650853830






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662265862


   **[Test build #126305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126305/testReport)** for PR 28898 at commit [`ab10486`](https://github.com/apache/spark/commit/ab10486293fd729ba76fdee9ba3661b4d265571d).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652169651


   **[Test build #124691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124691/testReport)** for PR 28898 at commit [`c95633e`](https://github.com/apache/spark/commit/c95633e7a242e622900c596c03dd7d3c06441732).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448639798



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
             nestedFieldToAlias
               .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
               .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+          Some((attr.exprId, nestedFieldToAlias))
         } else {
           None
         }
       }
+      .groupBy(_._1) // To fix same ExprId mapped to different attribute instance

Review comment:
       Simplified the logic above to do grouping by `ExprId` and updated the deduplication logic. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649991400






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650025446


   How do I retest? Looks like it failed for a random reason. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451278529



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)

Review comment:
       But I ran this test in current master and it passed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650692254






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652826043


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652788243






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649269967


   Let me see if I can get a more generalized solution out today. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649942950


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124524/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200055


   **[Test build #125268 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125268/testReport)** for PR 28898 at commit [`a0998ae`](https://github.com/apache/spark/commit/a0998ae4166efff59b62971e98945e201809fa15).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657154729


   **[Test build #125699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125699/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649199639


   btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499494



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for window functions") {

Review comment:
       `Nested field pruning for window functions` -> `Nested field pruning for Window`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656075862


   **[Test build #125466 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125466/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200009


   Looks okay. cc: @viirya @dongjoon-hyun 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r453294196



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,60 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |)
+        |select name.first, rank from contact_rank
+        |where name.first = 'Jane' AND rank = 1
+        |""".stripMargin
+    val query1 = sql(windowSql)
+    checkScan(query1, "struct<id:int,name:struct<first:string>,address:string>")
+    checkAnswer(query1, Row("Jane", 1) :: Nil)

Review comment:
       Changed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652788243






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663364305






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gatorsmile commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

gatorsmile commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649087502


   cc @viirya @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657904652


   **[Test build #125795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125795/testReport)** for PR 28898 at commit [`2637974`](https://github.com/apache/spark/commit/2637974650f232a6aad83e0e4a3e1fdf03def401).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652831500






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649937189






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650079629


   **[Test build #124540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124540/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650853830






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652170847






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649942946






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653012314


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649937189






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548001



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+      .orderBy($"${aliases(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for Filter with other operators") {

Review comment:
       Yep, I will include them in the test names. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448718366



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       Take a look at this query: 
   ```
   with contact_rank as (
     select row_number() over (partition by address order by id desc) as rank,
     contacts.*
     from contacts
     order by name.last, name.first
   )
   select name.first, rank from contact_rank
   ```
   ```
   nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
         .filter(!_.references.subsetOf(exclusiveAttrSet))
         .groupBy(_.references.head)
   ``` 
   returns:
   ```
   0 = {Tuple2@15534} "(name#46,List(name#46.first))"
   1 = {Tuple2@15535} "(name#46,List(name#46.last, name#46.first))"
   ```
   
   Basically `name#46` is the same attribute of type `AttributeReference`. The only attribute that is different is `qualifier`. Hence such change is needed. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650080284






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649912241


   **[Test build #124525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124525/testReport)** for PR 28898 at commit [`4c705bd`](https://github.com/apache/spark/commit/4c705bd5e7cbeae2603afe799a338e068c35923c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520292


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650079629


   **[Test build #124540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124540/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649884335


   This is not a bugfix, so we will merge this commit only into master(v3.1.0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451223742



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)

Review comment:
       It is pruning but not generating aliases. 
   https://github.com/apache/spark/pull/28898/files/a0998ae4166efff59b62971e98945e201809fa15#diff-d87f0060a604ac3e7149bd108fd548a5R505
   This line suggests it only select `first` instead of the whole column `name`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655680974






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655199003


   @viirya @maropu Rebased with the current master. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649990865


   **[Test build #124525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124525/testReport)** for PR 28898 at commit [`4c705bd`](https://github.com/apache/spark/commit/4c705bd5e7cbeae2603afe799a338e068c35923c).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649942879


   **[Test build #124524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124524/testReport)** for PR 28898 at commit [`3c8cf11`](https://github.com/apache/spark/commit/3c8cf110b19bc5d0c9e89a8a031e6e4a557aa1b3).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652830838


   **[Test build #124878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124878/testReport)** for PR 28898 at commit [`2bd84a4`](https://github.com/apache/spark/commit/2bd84a4c7bdd4096712327784351695d53bf704c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655817273


   **[Test build #125392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125392/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656497633






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656948958


   **[Test build #125654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125654/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657154729


   **[Test build #125699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125699/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650692166


   **[Test build #124592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124592/testReport)** for PR 28898 at commit [`acce8c5`](https://github.com/apache/spark/commit/acce8c5d8d51bae5f981e56a8811f075cb07d214).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448705642



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -176,9 +192,16 @@ object NestedColumnAliasing {
         // By default, ColumnPruning rule uses `attr` already.
         if (nestedFieldToAlias.nonEmpty &&
             nestedFieldToAlias
-              .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
-              .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+              .foldLeft(Seq[ExtractValue]()) {
+                (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+                  curr._1 +: unique
+                } else {
+                  unique
+                }
+              }
+              .map { t => totalFieldNum(t.dataType)  }
+              .sum < totalFieldNum(attr._2)) {

Review comment:
       Updated. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657178726


   **[Test build #125699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125699/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun edited a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662257917






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657178823






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664067577


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663378385


   **[Test build #126477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126477/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708345


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656074434


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445321841



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
     case _: Sample => true
     case _: RepartitionByExpression => true
     case _: Join => true
+    case x: Filter => x.child match {
+      case _: Window => true

Review comment:
       I see, it is due to predicate pushdown rule. I think we need general solution as @maropu said. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451738919



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)

Review comment:
       Good point. Removed such test. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448792946



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       I agree, but I think this is outside this PR's scope that I first want to iterate this thing safely. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448712484



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       Why need this change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652044031






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657994219






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656497633


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448035514



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
             nestedFieldToAlias
               .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
               .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+          Some((attr.exprId, nestedFieldToAlias))

Review comment:
       ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649991400


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656496831


   **[Test build #125536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458545107



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+      .orderBy($"${aliases(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for Filter with other operators") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query1 = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val aliases1 = collectGeneratedAliases(optimized1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+      .where($"window" === 1 && $"${aliases1(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+
+    val query2 = contact.sortBy($"name.first".asc)
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized2 = Optimize.execute(query2)
+    val aliases2 = collectGeneratedAliases(optimized2)
+    val expected2 = contact
+      .select($"name.first".as(aliases2(1)))
+      .sortBy($"${aliases2(1)}".asc)
+      .select($"${aliases2(1)}".as(aliases2(0)))
+      .where($"${aliases2(0)}" === "a")
+      .select($"${aliases2(0)}".as("first"))
+      .analyze
+    comparePlans(optimized2, expected2)

Review comment:
       Shall we move this test case into `test("Nested field pruning for Sort")`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r453294078



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,60 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |)
+        |select name.first, rank from contact_rank
+        |where name.first = 'Jane' AND rank = 1
+        |""".stripMargin
+    val query1 = sql(windowSql)
+    checkScan(query1, "struct<id:int,name:struct<first:string>,address:string>")
+    checkAnswer(query1, Row("Jane", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in window function and then order by") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |  order by name.last, name.first
+        |)
+        |select name.first, rank from contact_rank
+        |""".stripMargin
+    val query1 = sql(windowSql)
+    checkScan(query1, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+    checkAnswer(query1,
+      Row("Jane", 1) ::
+        Row("John", 1) ::
+        Row("Janet", 1) ::
+        Row("Jim", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in Sort") {
+    val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+    checkScan(query1, "struct<name:struct<first:string,last:string>>")
+    checkAnswer(query1,
+      Row("Jane", "Doe") ::
+        Row("Janet", "Jones") ::
+        Row("Jim", "Jones") ::
+        Row("John", "Doe") :: Nil)
+
+    val query2 = sql("select name.first, name.last from contacts sort by name.first, name.last")
+    checkScan(query2, "struct<name:struct<first:string,last:string>>")
+    checkAnswer(query1,

Review comment:
       Good catch. sort by is a local sort, so I also updated this test to do repartition first to make results more predictable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445935219



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,14 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    case Project(projectList, Filter(condition, child))

Review comment:
       I think we better leave a few comment explaining this case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657230025






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448743319



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       I see. I think we can simply clean up `qualifier` of returned `AttributeReference` in `collectRootReferenceAndExtractValue`.
   
   We don't need the `qualifier`. We just use `exprId` and `dataType`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548850



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,20 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    /**
+     * This pattern is needed to support [[Filter]] plan cases like
+     * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., [[Window]]).
+     * The reason why we don't simply add [[Filter]] in `canProjectPushThrough` is that
+     * the optimizer can hit an infinite loop during the [[PushDownPredicates]] rule.
+     */
+    case Project(projectList, Filter(condition, child))

Review comment:
       BTW, it's logically a little weird to me because the second pattern looks narrower than the first pattern. In Scala, we usually use specific patterns first. I'm saying that `case Project(projectList, Filter(condition, child))` is more specific than the previous pattern `case Project(projectList, child)`. Can we switch this case (line 48) and the previous case (line 34). Does it break something?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656497640


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125536/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445512898



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       Btw, I feel the title and the PR description are not accurate because this PR is not only for supporting the window case in nested pruning. Could you make them clearer?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448753272



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       The problem is your approach is, which one is selected from the `ExtractValue`s with different qualifier, is non-deterministic. 
   
   Later when we query the `nestedFieldToAlias` map, you might fail to find the corresponding item from the map due to qualifier difference.
   
   I think the safer approach is to clean up all qualifier.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       oh, I think it is also due different `qualifier` in the `ExtractValue`s. As above, I think we can just clean up `qualifier`.

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       Like:
   
   ```scala
   
   def removeQualifier(f: ExtractValue): ExtractValue = {
     f.transform {
       case a: AttributeReference => a.withQualifier(Seq.empty)
     }.asInstanceOf[ExtractValue]
   }
   ```
   
   ```scala
   val dedupNestedFields = nestedFields.filter {
     case e @ (_: GetStructField | _: GetArrayStructFields) =>
       val child = e.children.head
       nestedFields.forall(f => child.find(_.semanticEquals(f)).isEmpty)
     case _ => true
   }.map(removeQualifier)
   ```
   
   And when we need to query the map, we do:
   ```scala
   nestedFieldToAlias.contains(removeQualifier(f))
   ```
   
   or 
   
   ```scala
   nestedFieldToAlias(removeQualifier(f)).toAttribute
   ```

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       We don't need to change `collectRootReferenceAndExtractValue`.
   
   ```scala
   val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
     .filter(!_.references.subsetOf(exclusiveAttrSet))
     .groupBy(_.references.head.withQualifier(Seq.empty))
     ...
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual edited a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual edited a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650024534


   Jenkins, retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649954378


   **[Test build #124529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124529/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657993700


   **[Test build #125795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125795/testReport)** for PR 28898 at commit [`2637974`](https://github.com/apache/spark/commit/2637974650f232a6aad83e0e4a3e1fdf03def401).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200055






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653012320


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124882/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652842493






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652044031






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662241141






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r454012695



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,67 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |)
+        |select name.first, rank from contact_rank
+        |where name.first = 'Jane' AND rank = 1
+        |""".stripMargin
+    val query = sql(windowSql)
+    checkScan(query, "struct<id:int,name:struct<first:string>,address:string>")
+    checkAnswer(query, Row("Jane", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in window function and then order by") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |  order by name.last, name.first
+        |)
+        |select name.first, rank from contact_rank
+        |""".stripMargin
+    val query = sql(windowSql)
+    checkScan(query, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+    checkAnswer(query,
+      Row("Jane", 1) ::
+        Row("John", 1) ::
+        Row("Janet", 1) ::
+        Row("Jim", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in Sort") {
+    val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+    checkScan(query1, "struct<name:struct<first:string,last:string>>")
+    checkAnswer(query1,
+      Row("Jane", "Doe") ::
+        Row("Janet", "Jones") ::
+        Row("Jim", "Jones") ::
+        Row("John", "Doe") :: Nil)
+
+    // Create a repartitioned view because `SORT BY` is a local sort
+    sql("select * from contacts").repartition(1).createOrReplaceTempView("tmp_contacts")
+    val sortBySql =

Review comment:
       Wrap with a `withTempView`? Why can't use `contacts`? Is local sort any different to the test here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458547437



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+      .orderBy($"${aliases(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for Filter with other operators") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query1 = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val aliases1 = collectGeneratedAliases(optimized1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+      .where($"window" === 1 && $"${aliases1(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+
+    val query2 = contact.sortBy($"name.first".asc)
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized2 = Optimize.execute(query2)
+    val aliases2 = collectGeneratedAliases(optimized2)
+    val expected2 = contact
+      .select($"name.first".as(aliases2(1)))
+      .sortBy($"${aliases2(1)}".asc)
+      .select($"${aliases2(1)}".as(aliases2(0)))
+      .where($"${aliases2(0)}" === "a")
+      .select($"${aliases2(0)}".as("first"))
+      .analyze
+    comparePlans(optimized2, expected2)
+
+    val query3 = contact.distribute($"name.first")(100)
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized3 = Optimize.execute(query3)
+    val aliases3 = collectGeneratedAliases(optimized3)
+    val expected3 = contact
+      .select($"name.first".as(aliases3(1)))
+      .distribute($"${aliases3(1)}")(100)
+      .select($"${aliases3(1)}".as(aliases3(0)))
+      .where($"${aliases3(0)}" === "a")
+      .select($"${aliases3(0)}".as("first"))
+      .analyze
+    comparePlans(optimized3, expected3)
+
+    val department = LocalRelation(
+      'depID.int,
+      'personID.string)
+    val query4 = contact.join(department, condition = Some($"id" === $"depID"))
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized4 = Optimize.execute(query4)
+    val aliases4 = collectGeneratedAliases(optimized4)
+    val expected4 = contact
+      .select($"id", $"name.first".as(aliases4(1)))
+      .join(department.select('depID), condition = Some($"id" === $"depID"))
+      .select($"${aliases4(1)}".as(aliases4(0)))
+      .where($"${aliases4(0)}" === "a")
+      .select($"${aliases4(0)}".as("first"))
+      .analyze
+    comparePlans(optimized4, expected4)
+
+    def runTest(basePlan: LogicalPlan => LogicalPlan): Unit = {
+      val query = basePlan(contact)
+        .where($"name.first" === "a")
+        .select($"name.first")
+        .analyze
+      val optimized = Optimize.execute(query)
+      val aliases = collectGeneratedAliases(optimized)
+      val expected = basePlan(contact
+        .select($"name.first".as(aliases(0))))
+        .where($"${aliases(0)}" === "a")
+        .select($"${aliases(0)}".as("first"))
+        .analyze
+      comparePlans(optimized, expected)
+    }
+    Seq(
+      (plan: LogicalPlan) => plan.limit(100),
+      (plan: LogicalPlan) => plan.repartition(100),
+      (plan: LogicalPlan) => Sample(0.0, 0.6, false, 11L, plan)).foreach {  base =>
+        runTest(base)
+      }

Review comment:
       This is a test the combination of `Filter-> Sample/GlobalLimit/LocalLimit/Repartition`, so that's why it's under this test name -- to test for the combination of `Filter` and other children that can be pushed through. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun edited a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662257917


   @frankyin-factual . Thank you for updating. In general, it's a nice improvement contribution.
   - For the test cases, you may have a different idea. 
   - For the `unapply` pattern stuff, I believe we need more comment on that code path because it looks suspicious logically.
   
   I'll review tomorrow again with a fresh eye and build and test more by myself. That helps me review. (For me, it's 11PM night since I'm at PST timezone.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r454033847



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,67 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |)
+        |select name.first, rank from contact_rank
+        |where name.first = 'Jane' AND rank = 1
+        |""".stripMargin
+    val query = sql(windowSql)
+    checkScan(query, "struct<id:int,name:struct<first:string>,address:string>")
+    checkAnswer(query, Row("Jane", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in window function and then order by") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |  order by name.last, name.first
+        |)
+        |select name.first, rank from contact_rank
+        |""".stripMargin
+    val query = sql(windowSql)
+    checkScan(query, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+    checkAnswer(query,
+      Row("Jane", 1) ::
+        Row("John", 1) ::
+        Row("Janet", 1) ::
+        Row("Jim", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in Sort") {
+    val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+    checkScan(query1, "struct<name:struct<first:string,last:string>>")
+    checkAnswer(query1,
+      Row("Jane", "Doe") ::
+        Row("Janet", "Jones") ::
+        Row("Jim", "Jones") ::
+        Row("John", "Doe") :: Nil)
+
+    // Create a repartitioned view because `SORT BY` is a local sort
+    sql("select * from contacts").repartition(1).createOrReplaceTempView("tmp_contacts")
+    val sortBySql =

Review comment:
       Updated. Thanks. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649907836






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652826054


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124842/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708222


   **[Test build #124592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124592/testReport)** for PR 28898 at commit [`acce8c5`](https://github.com/apache/spark/commit/acce8c5d8d51bae5f981e56a8811f075cb07d214).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649311526






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649907836






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499537



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for window functions") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+    val query = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases(1)}".as(aliases(0)), $"window")
+      .where($"window" === 1 && $"${aliases(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for orderBy") {

Review comment:
       Actually, `Nested field pruning for Sort`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653010073


   **[Test build #124882 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124882/testReport)** for PR 28898 at commit [`b5292bd`](https://github.com/apache/spark/commit/b5292bd9b602e32e7f460a2b1b69ddf0f3633bf3).
    * This patch **fails PySpark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448759360



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       I think you might reply in the wrong comments, but for that `groupBy`, only `exprId` and `dataType` is referenced. So the situation you described shouldn't matter. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652734341


   **[Test build #124842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124842/testReport)** for PR 28898 at commit [`d043e38`](https://github.com/apache/spark/commit/d043e3846432981ac7f9b85cb4515499ab6f0118).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068067


   **[Test build #126598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126598/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448793072



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       Sure. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652841875


   **[Test build #124882 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124882/testReport)** for PR 28898 at commit [`b5292bd`](https://github.com/apache/spark/commit/b5292bd9b602e32e7f460a2b1b69ddf0f3633bf3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649976637


   **[Test build #124527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124527/testReport)** for PR 28898 at commit [`ab39d24`](https://github.com/apache/spark/commit/ab39d245660c16c0c11d0a37f73f84f74afd7951).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520197


   **[Test build #125565 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656944368






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662240866






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448715891



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       Why we need to deduplicate again?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662266294






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445903857



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       Yeah, I will update this PR later tonight. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199334






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649976746


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656523393


   This might be a record for how fast the test is failing. :(


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649907459


   **[Test build #124524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124524/testReport)** for PR 28898 at commit [`3c8cf11`](https://github.com/apache/spark/commit/3c8cf110b19bc5d0c9e89a8a031e6e4a557aa1b3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656944368






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446065768



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,14 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    case Project(projectList, Filter(condition, child))

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662240643


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657904956






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655334501






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649310567


   **[Test build #124511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124511/testReport)** for PR 28898 at commit [`ef21b63`](https://github.com/apache/spark/commit/ef21b6352a825c3c779f8fd5ffa6025ec77d372e).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657154379


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-647849466


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199334






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu closed pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

maropu closed pull request #28898:
URL: https://github.com/apache/spark/pull/28898


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652734686






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649954884






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662240866


   **[Test build #126296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126296/testReport)** for PR 28898 at commit [`2637974`](https://github.com/apache/spark/commit/2637974650f232a6aad83e0e4a3e1fdf03def401).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652842493






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548850



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,20 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    /**
+     * This pattern is needed to support [[Filter]] plan cases like
+     * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., [[Window]]).
+     * The reason why we don't simply add [[Filter]] in `canProjectPushThrough` is that
+     * the optimizer can hit an infinite loop during the [[PushDownPredicates]] rule.
+     */
+    case Project(projectList, Filter(condition, child))

Review comment:
       BTW, it's logically a little weird because the second pattern is narrower than the first pattern. In Scala, we usually use specific patterns first. I'm saying that `case Project(projectList, Filter(condition, child))` is more specific than the previous pattern `case Project(projectList, child)`. Can we switch this case (line 48) and the previous case (line 34). Does it break something?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499982



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,40 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as __rank,

Review comment:
       nit: `__rank` -> `rank`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445251813



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
     case _: Sample => true
     case _: RepartitionByExpression => true
     case _: Join => true
+    case x: Filter => x.child match {
+      case _: Window => true

Review comment:
       Why we need this? We cannot support the window case only with the change `case _: Window => true`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652043485


   **[Test build #124691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124691/testReport)** for PR 28898 at commit [`c95633e`](https://github.com/apache/spark/commit/c95633e7a242e622900c596c03dd7d3c06441732).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650192029






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445903069



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       > How about use my proposal at #28898 (review)?
   
   If we cannot, yea, I think we need special handling for `Filter` as @viirya suggested above.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655331929


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451224944



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+
+    val query2 = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized2 = Optimize.execute(query2)
+    val aliases2 = collectGeneratedAliases(optimized2)
+    val expected2 = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases2(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases2(1)}".as(aliases2(0)))
+      .orderBy($"${aliases2(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized2, expected2)
+  }
+
+  test("Nested field pruning for Filter") {

Review comment:
       Name changed. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649152732


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655334501


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655633032


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125337/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520292






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708349


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124592/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663533813


   **[Test build #126477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126477/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656514679






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663534557






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445935343



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for window functions") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+    val query1 = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val aliases1 = collectGeneratedAliases(optimized1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+      .where($"window" === 1 && $"${aliases1(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+  }
+
+  test("Nested field pruning for orderBy") {
+    val query1 = contact.select($"name.first", $"name.last")
+      .orderBy($"name.first".asc, $"name.last".asc)
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val aliases1 = collectGeneratedAliases(optimized1)
+    val expected1 = contact
+      .select($"name.first",
+        $"name.last",
+        $"name.first".as(aliases1(0)),
+        $"name.last".as(aliases1(1)))
+      .orderBy($"${aliases1(0)}".asc, $"${aliases1(1)}".asc)
+      .select($"first", $"last")
+      .analyze
+    comparePlans(optimized1, expected1)
+  }
+
+  test("Nested field pruning for sirtBy") {

Review comment:
       Do you mean sortBy?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448806690



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,12 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
+      // The above groupBy is to avoid situation like the following situation.
+      // For example, `exprIdA -> List(a, b)` and  `exprIdA -> List(c, d)`

Review comment:
       Just pushed another comment. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448774792



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       No, I mean this comment thread.
   
   I am not sure if you are aware of it. The reason you need to deduplicate here, is because the semantically same `ExtractValue`s apply on attributes with different qualifier, e.g. there are two `name.first`, but one refers to `name` with qualifier `a` and another refers to qualifier `b`.
   
   I did a test using your query and cleaned up all qualifiers as I showed, it works well.
   
   And what I said in above comment is, you select arbitrary one `ExtractValue` from these `ExtractValue` with different qualifiers, but later we will look into the map using given `ExtractValue`. You might fail a case that you select the `name.first` with qualifier `a`, but later you look at the map using `name.first` with qualifier `b`.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-647849466


   Can one of the admins verify this patch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445623584



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       How about use my proposal?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       I don't think this is correct fix. It will push through a child that should not be pushed through.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663534557






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652825077


   **[Test build #124842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124842/testReport)** for PR 28898 at commit [`d043e38`](https://github.com/apache/spark/commit/d043e3846432981ac7f9b85cb4515499ab6f0118).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448713879



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
     val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
     val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
       .filter(!_.references.subsetOf(exclusiveAttrSet))
-      .groupBy(_.references.head)
+      .groupBy(t => (t.references.head.exprId, t.references.head.dataType))

Review comment:
       https://github.com/apache/spark/pull/28898#discussion_r447950809
   
   So basically the same expression can have different table aliases; thus cause collision during the map insertion. 
   
   This is to group by the exprId, which is the unique identifier. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662241141






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r454021579



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,67 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |)
+        |select name.first, rank from contact_rank
+        |where name.first = 'Jane' AND rank = 1
+        |""".stripMargin
+    val query = sql(windowSql)
+    checkScan(query, "struct<id:int,name:struct<first:string>,address:string>")
+    checkAnswer(query, Row("Jane", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in window function and then order by") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |  order by name.last, name.first
+        |)
+        |select name.first, rank from contact_rank
+        |""".stripMargin
+    val query = sql(windowSql)
+    checkScan(query, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+    checkAnswer(query,
+      Row("Jane", 1) ::
+        Row("John", 1) ::
+        Row("Janet", 1) ::
+        Row("Jim", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in Sort") {
+    val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+    checkScan(query1, "struct<name:struct<first:string,last:string>>")
+    checkAnswer(query1,
+      Row("Jane", "Doe") ::
+        Row("Janet", "Jones") ::
+        Row("Jim", "Jones") ::
+        Row("John", "Doe") :: Nil)
+
+    // Create a repartitioned view because `SORT BY` is a local sort
+    sql("select * from contacts").repartition(1).createOrReplaceTempView("tmp_contacts")
+    val sortBySql =

Review comment:
       Yeah, because it's a sort per partition, so the result isn't exactly predictable. By doing `repartition`, we can make sure this test isn't flaky. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282323






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458544989



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+      .orderBy($"${aliases(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for Filter with other operators") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query1 = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val aliases1 = collectGeneratedAliases(optimized1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+      .where($"window" === 1 && $"${aliases1(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)

Review comment:
       Shall we move this test case into `test("Nested field pruning for Window")`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649936846


   **[Test build #124527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124527/testReport)** for PR 28898 at commit [`ab39d24`](https://github.com/apache/spark/commit/ab39d245660c16c0c11d0a37f73f84f74afd7951).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657005551






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520305


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125565/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650014778


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124529/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657229858


   **[Test build #125714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125714/testReport)** for PR 28898 at commit [`68cbfd2`](https://github.com/apache/spark/commit/68cbfd24f1f50266c5d1c5dfc24e29699f87c3e3).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653006001






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r453280237



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,60 @@ abstract class SchemaPruningSuite
     checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
   }
 
+  testSchemaPruning("select nested field in window function") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |)
+        |select name.first, rank from contact_rank
+        |where name.first = 'Jane' AND rank = 1
+        |""".stripMargin
+    val query1 = sql(windowSql)
+    checkScan(query1, "struct<id:int,name:struct<first:string>,address:string>")
+    checkAnswer(query1, Row("Jane", 1) :: Nil)
+  }
+
+  testSchemaPruning("select nested field in window function and then order by") {
+    val windowSql =
+      """
+        |with contact_rank as (
+        |  select row_number() over (partition by address order by id desc) as rank,
+        |  contacts.*
+        |  from contacts
+        |  order by name.last, name.first
+        |)
+        |select name.first, rank from contact_rank
+        |""".stripMargin
+    val query1 = sql(windowSql)
+    checkScan(query1, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+    checkAnswer(query1,

Review comment:
       query1 -> query




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663364305






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662266294






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445902694



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       > That won’t work because it seems causing an infinite loop in optimizer. It gives me error messages like running out of max iterations.
   >> I see, it is due to predicate pushdown rule.
   
   I don't look into it though, we cannot fix the infinite loop caused by the predicate pushdown rule? If we can put `Filter` in `canProjectPushThrough`, it looks the best.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650014768


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448779163



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
           (f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
         }
 
+
+        // Do deduplication based on semanticEquals, and then sum.
+        val nestedFieldNum = nestedFieldToAlias
+          .foldLeft(Seq[ExtractValue]()) {
+            (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+              curr._1 +: unique
+            } else {
+              unique
+            }
+          }
+          .map { t => totalFieldNum(t.dataType)  }
+          .sum

Review comment:
       No, I mean you probably misread the code here. This map still returns `ExprId->Seq(name.last, name.first, name.first)` because I didn't change the lookup map here. All I change is to count the field number differently so that it won't trigger the `else` statement -- which means all leaf nodes are covered, no schema pruning is required. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282432


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126296/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653001957


   **[Test build #124878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124878/testReport)** for PR 28898 at commit [`2bd84a4`](https://github.com/apache/spark/commit/2bd84a4c7bdd4096712327784351695d53bf704c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662254203


   > "Allow nested schema pruning thru window/sort/filter plans" looks like a little an over-claim to me. Technically, this PR doesn't support all general `Filter` plans, does it? IIUC, this PR only handles `Filter(_, child)` where `child` is true by `canProjectPushThrough`. In other words, if `child` is not that type, this PR cannot push into `Filter` plan.
   > 
   > Although the PR description is correct by mentioning `Project->Filter->[any node can be pruned]`, it would be better avoid the misleading PR title. You can focus on `Window/Sort` on the PR title and PR description still can have your contribution on `Project->Filter->[any node can be pruned]`.
   
   Just changed PR titles to not have `Filter` in it. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655368737






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653012314






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649502299






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655680784


   **[Test build #125392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125392/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458550049



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+      .orderBy($"${aliases(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for Filter with other operators") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query1 = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val aliases1 = collectGeneratedAliases(optimized1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+      .where($"window" === 1 && $"${aliases1(0)}" === "a")
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)
+
+    val query2 = contact.sortBy($"name.first".asc)
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized2 = Optimize.execute(query2)
+    val aliases2 = collectGeneratedAliases(optimized2)
+    val expected2 = contact
+      .select($"name.first".as(aliases2(1)))
+      .sortBy($"${aliases2(1)}".asc)
+      .select($"${aliases2(1)}".as(aliases2(0)))
+      .where($"${aliases2(0)}" === "a")
+      .select($"${aliases2(0)}".as("first"))
+      .analyze
+    comparePlans(optimized2, expected2)
+
+    val query3 = contact.distribute($"name.first")(100)
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized3 = Optimize.execute(query3)
+    val aliases3 = collectGeneratedAliases(optimized3)
+    val expected3 = contact
+      .select($"name.first".as(aliases3(1)))
+      .distribute($"${aliases3(1)}")(100)
+      .select($"${aliases3(1)}".as(aliases3(0)))
+      .where($"${aliases3(0)}" === "a")
+      .select($"${aliases3(0)}".as("first"))
+      .analyze
+    comparePlans(optimized3, expected3)
+
+    val department = LocalRelation(
+      'depID.int,
+      'personID.string)
+    val query4 = contact.join(department, condition = Some($"id" === $"depID"))
+      .where($"name.first" === "a")
+      .select($"name.first")
+      .analyze
+    val optimized4 = Optimize.execute(query4)
+    val aliases4 = collectGeneratedAliases(optimized4)
+    val expected4 = contact
+      .select($"id", $"name.first".as(aliases4(1)))
+      .join(department.select('depID), condition = Some($"id" === $"depID"))
+      .select($"${aliases4(1)}".as(aliases4(0)))
+      .where($"${aliases4(0)}" === "a")
+      .select($"${aliases4(0)}".as("first"))
+      .analyze
+    comparePlans(optimized4, expected4)
+
+    def runTest(basePlan: LogicalPlan => LogicalPlan): Unit = {
+      val query = basePlan(contact)
+        .where($"name.first" === "a")
+        .select($"name.first")
+        .analyze
+      val optimized = Optimize.execute(query)
+      val aliases = collectGeneratedAliases(optimized)
+      val expected = basePlan(contact
+        .select($"name.first".as(aliases(0))))
+        .where($"${aliases(0)}" === "a")
+        .select($"${aliases(0)}".as("first"))
+        .analyze
+      comparePlans(optimized, expected)
+    }
+    Seq(
+      (plan: LogicalPlan) => plan.limit(100),
+      (plan: LogicalPlan) => plan.repartition(100),
+      (plan: LogicalPlan) => Sample(0.0, 0.6, false, 11L, plan)).foreach {  base =>
+        runTest(base)
+      }

Review comment:
       I know that but this PR doesn't support `Filter` completely. I believe we had better collect these simple test case addition there.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662266123


   @dongjoon-hyun Thanks for reviewing this late. 
   For the test cases, I think it might be better to group all the `Project->Filter->[any node can be pruned]` cases together because it is the newly introduced path. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458550558



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+    val query = contact
+      .select($"name.first", winExpr.as('window))
+      .orderBy($"name.last".asc)
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+      .orderBy($"${aliases(0)}".asc)
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized, expected)
+  }
+
+  test("Nested field pruning for Filter with other operators") {

Review comment:
       Looks like if I listed all the operations, it will be a lengthy line. So I used `supported operators`. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451393487



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for Window") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber(), spec)
+
+    val query1 = contact
+      .select($"name.first", winExpr.as('window))
+      .analyze
+    val optimized1 = Optimize.execute(query1)
+    val expected1 = contact
+      .select($"name.first", $"address", $"id")
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"window")
+      .analyze
+    comparePlans(optimized1, expected1)

Review comment:
       Exactly, this isn’t improved in this pr as it always works. Not sure whether we want to remove this test. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664715448


   Thanks for the check, @dongjoon-hyun and @viirya ! I checked it again and I have no more comment. Merged to master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-658954329


   @dongjoon-hyun friendly bump


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650977869






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199034


   **[Test build #125714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125714/testReport)** for PR 28898 at commit [`68cbfd2`](https://github.com/apache/spark/commit/68cbfd24f1f50266c5d1c5dfc24e29699f87c3e3).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650190831


   **[Test build #124540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124540/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548850



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,20 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    /**
+     * This pattern is needed to support [[Filter]] plan cases like
+     * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., [[Window]]).
+     * The reason why we don't simply add [[Filter]] in `canProjectPushThrough` is that
+     * the optimizer can hit an infinite loop during the [[PushDownPredicates]] rule.
+     */
+    case Project(projectList, Filter(condition, child))

Review comment:
       BTW, it's logically a little weird to me because the second pattern looks narrower than the first pattern. In Scala, we usually use specific patterns first. I'm saying that `case Project(projectList, Filter(condition, child))` is more specific than the previous pattern `case Project(projectList, child)`. Can we switch this case (line 48) and the previous case (line 34). Or, does it break something? If switching two patterns breaks something, it might be worth to mention.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282323






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655680784


   **[Test build #125392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125392/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650852190


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655818014






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657005354


   **[Test build #125654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125654/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499675



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,22 @@ object NestedColumnAliasing {
           NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
       }
 
+    /**
+     * This is to solve a `LogicalPlan` like `Project`->`Filter`->`Window`.
+     * In this case, `Window` can be plan that is `canProjectPushThrough`.
+     * By adding this, it allows nested columns to be passed onto next stages.
+     * Currently, not adding `Filter` into `canProjectPushThrough` due to
+     * infinitely loop in optimizers during the predicate push-down rule.
+     */
+

Review comment:
       nit: remove this unnecessary blank.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649153978






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448035806



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
             nestedFieldToAlias
               .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
               .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+          Some((attr.exprId, nestedFieldToAlias))
         } else {
           None
         }
       }
+      .groupBy(_._1) // To fix same ExprId mapped to different attribute instance

Review comment:
       You meant this fix is only for the Window case?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445510771



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
 
   def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
     case Project(projectList, child)
-        if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+        if SQLConf.get.nestedSchemaPruningEnabled &&
+          (canProjectPushThrough(child) ||
+            getChild(child).exists(canProjectPushThrough)) =>

Review comment:
       Is this correct? I'm not 100% sure that this matching case can handle this condition: `Project->[*Any* logical unary node]->[Logical node that can be pushed through]`. Anyway, we need more tests for this change.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652669311


   **[Test build #124822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124822/testReport)** for PR 28898 at commit [`2bc6a1a`](https://github.com/apache/spark/commit/2bc6a1ae1f2fb7c657d95d5abd92615fdc95eaef).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708345






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200336






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r447948500



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
             nestedFieldToAlias
               .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
               .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+          Some((attr.exprId, nestedFieldToAlias))

Review comment:
       https://github.com/apache/spark/pull/28898/files#diff-957112380b0a2ef014abc8227d0b70acR479-R496




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649912241


   **[Test build #124525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124525/testReport)** for PR 28898 at commit [`4c705bd`](https://github.com/apache/spark/commit/4c705bd5e7cbeae2603afe799a338e068c35923c).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655330636






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445255141



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
     case _: Sample => true
     case _: RepartitionByExpression => true
     case _: Join => true
+    case x: Filter => x.child match {
+      case _: Window => true

Review comment:
       Looks like the plan is a `Project -> Filter -> Window`. If we only do `case _: Window => true`, the projection aliasing won't be available at the `Window` stage, and can't be passed onto later stages described in the ticket. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445294440



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
     case _: Sample => true
     case _: RepartitionByExpression => true
     case _: Join => true
+    case x: Filter => x.child match {
+      case _: Window => true

Review comment:
       Then just add `case _: Filter => true`, if you want to let project pushed through `Filter`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655633024


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652669795






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656434653


   **[Test build #125536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656228054


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125466/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649936846


   **[Test build #124527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124527/testReport)** for PR 28898 at commit [`ab39d24`](https://github.com/apache/spark/commit/ab39d245660c16c0c11d0a37f73f84f74afd7951).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-660563330


   @maropu @viirya @dongjoon-hyun friendly bump 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650077585


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649310142


   Justed pushed a generalized solution. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448704588



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -176,9 +192,16 @@ object NestedColumnAliasing {
         // By default, ColumnPruning rule uses `attr` already.
         if (nestedFieldToAlias.nonEmpty &&
             nestedFieldToAlias
-              .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
-              .sum < totalFieldNum(attr.dataType)) {
-          Some(attr.exprId -> nestedFieldToAlias)
+              .foldLeft(Seq[ExtractValue]()) {
+                (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+                  curr._1 +: unique
+                } else {
+                  unique
+                }
+              }
+              .map { t => totalFieldNum(t.dataType)  }
+              .sum < totalFieldNum(attr._2)) {

Review comment:
       Sure. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652170847






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

Posted by GitBox <gi...@apache.org>.

frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445331787



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
     case _: Sample => true
     case _: RepartitionByExpression => true
     case _: Join => true
+    case x: Filter => x.child match {
+      case _: Window => true

Review comment:
       Do you have a sample query that produces `Project->Filter->Sample`? I've been trying to come up with a query that generates this plan. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions

Posted by GitBox <gi...@apache.org>.

SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650014689


   **[Test build #124529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124529/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068267






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656433112


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656434949






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664132483






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans

Posted by GitBox <gi...@apache.org>.

maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499910



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
     comparePlans(optimized3, expected3)
   }
 
+  test("Nested field pruning for window functions") {
+    val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+    val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+    val query = contact.select($"name.first", winExpr.as('window))
+      .where($"window" === 1 && $"name.first" === "a")
+      .analyze
+    val optimized = Optimize.execute(query)
+    val aliases = collectGeneratedAliases(optimized)
+    val expected = contact
+      .select($"name.first", $"address", $"id", $"name.first".as(aliases(1)))
+      .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+      .select($"first", $"${aliases(1)}".as(aliases(0)), $"window")
+      .where($"window" === 1 && $"${aliases(0)}" === "a")

Review comment:
       Just a suggestion: could you remove this `where` in this test, then add a separate test unit like `test("Nested field pruning for Filter") {` for exhaustive filter tests?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org