You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/23 01:14:29 UTC
[GitHub] [spark] frankyin-factual opened a new pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual opened a new pull request #28898:
URL: https://github.com/apache/spark/pull/28898
This is to solve the schema pruning not working in window functions. This is a fairly limited version that only intend to solve issues within a `Filter` logical plan.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r447977410
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,87 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+
+ val query2 = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized2 = Optimize.execute(query2)
+ val aliases2 = collectGeneratedAliases(optimized2)
+ val expected2 = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases2(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases2(1)}".as(aliases2(0)))
+ .orderBy($"${aliases2(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized2, expected2)
+ }
+
+ test("Nested field pruning for Filter") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+ val query = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
Review comment:
Done
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068267
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448040612
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ Some((attr.exprId, nestedFieldToAlias))
} else {
None
}
}
+ .groupBy(_._1) // To fix same ExprId mapped to different attribute instance
Review comment:
Not necessarily only for the `Window` case, but by adding `Window/Filter/Sort`, this error can be surfaced more easily.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652786996
**[Test build #124822 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124822/testReport)** for PR 28898 at commit [`2bc6a1a`](https://github.com/apache/spark/commit/2bc6a1ae1f2fb7c657d95d5abd92615fdc95eaef).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656696370
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652831500
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282424
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448774792
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
No, I mean this comment thread.
I am not sure if you are aware of it. The reason you need to deduplicate here, is because the semantically same `ExtractValue`s apply on attributes with different qualifier, e.g. there are two `name.first`, but one refers to `name` with qualifier `a` and another refers to qualifier `b`.
I did a test using your query and cleaned up all qualifiers, it works well.
And what I said in above comment is, you select arbitrary one `ExtractValue` from these `ExtractValue` with different qualifiers, but later we will look into the map using given `ExtractValue`. You might fail a case that you select the `name.first` with qualifier `a`, but later you look at the map using `name.first` with qualifier `b`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199034
**[Test build #125714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125714/testReport)** for PR 28898 at commit [`68cbfd2`](https://github.com/apache/spark/commit/68cbfd24f1f50266c5d1c5dfc24e29699f87c3e3).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652668745
Just did a rebase to squash all those commits.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451691041
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
Review comment:
It passed without pruning through window. The query selects `name.first` on top of the relation. So I am not sure why you include it here...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652840781
> I think I come out a better idea for the discussion. But it is too late. I will send out a PR to your change tomorrow.
Thanks @viirya
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445623584
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
How about use my proposal at https://github.com/apache/spark/pull/28898#pullrequestreview-437202483?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662237766
Thank you for pinging me, @frankyin-factual , @maropu , @viirya . I'll take a look at this PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445309526
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
case _: Sample => true
case _: RepartitionByExpression => true
case _: Join => true
+ case x: Filter => x.child match {
+ case _: Window => true
Review comment:
That won’t work because it seems causing an infinite loop in optimizer. It gives me error messages like running out of max iterations.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451209212
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+
+ val query2 = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized2 = Optimize.execute(query2)
+ val aliases2 = collectGeneratedAliases(optimized2)
+ val expected2 = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases2(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases2(1)}".as(aliases2(0)))
+ .orderBy($"${aliases2(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized2, expected2)
+ }
+
+ test("Nested field pruning for Filter") {
Review comment:
It is not for only Filter. Maybe `Nested field pruning for Filter with other operators`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649224393
> btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
Just updated the PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656076133
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649153764
**[Test build #124507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124507/testReport)** for PR 28898 at commit [`b1cad9a`](https://github.com/apache/spark/commit/b1cad9ad759c3e1d2ef9efd0f9390c6924e412df).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655365105
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448795376
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
Just added.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499394
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for window functions") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+ val query = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases(1)}".as(aliases(0)), $"window")
+ .where($"window" === 1 && $"${aliases(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for orderBy") {
Review comment:
Why did you add the separate tests for orderBy/sortBy? They have the same plan, sort.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649311526
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448040082
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ Some((attr.exprId, nestedFieldToAlias))
Review comment:
It's related to the comment below. https://github.com/apache/spark/pull/28898#discussion_r447950809
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655368737
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448772028
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
I don't think `exprId` will has collision. `exprId` should be unique. If there is collision, the query plan should have some place wrong...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656922596
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125617/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068067
**[Test build #126598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126598/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448692070
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -176,9 +192,16 @@ object NestedColumnAliasing {
// By default, ColumnPruning rule uses `attr` already.
if (nestedFieldToAlias.nonEmpty &&
nestedFieldToAlias
- .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
- .sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum < totalFieldNum(attr._2)) {
Review comment:
I think Its hard to read this part. Could you pull the left value out from the `if` condition?
```
val xxx = yyy
if (xxx < totalFieldNum(attr._2)) { ...
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448783457
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
Ok. Can you add a comment here?
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
I see. I can see the bad thing about this is, you will have duplicate aliases for the semantically same `ExtractValue`s, e.g. two aliases for two `name.first`s.
It may not be a big deal here, but it is possibly in the optimized query plan, you will have multiple gen_alias_xxx which refer to the same `ExtractValue`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657005554
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125654/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446508750
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,22 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ /**
+ * This is to solve a `LogicalPlan` like `Project`->`Filter`->`Window`.
+ * In this case, `Window` can be plan that is `canProjectPushThrough`.
+ * By adding this, it allows nested columns to be passed onto next stages.
+ * Currently, not adding `Filter` into `canProjectPushThrough` due to
+ * infinitely loop in optimizers during the predicate push-down rule.
+ */
Review comment:
btw, do you know why the optimizer can hit the issue? I think its better to check the root cause for future activities if possible.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650853830
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662265862
**[Test build #126305 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126305/testReport)** for PR 28898 at commit [`ab10486`](https://github.com/apache/spark/commit/ab10486293fd729ba76fdee9ba3661b4d265571d).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652169651
**[Test build #124691 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124691/testReport)** for PR 28898 at commit [`c95633e`](https://github.com/apache/spark/commit/c95633e7a242e622900c596c03dd7d3c06441732).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448639798
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ Some((attr.exprId, nestedFieldToAlias))
} else {
None
}
}
+ .groupBy(_._1) // To fix same ExprId mapped to different attribute instance
Review comment:
Simplified the logic above to do grouping by `ExprId` and updated the deduplication logic.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649991400
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650025446
How do I retest? Looks like it failed for a random reason.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451278529
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
Review comment:
But I ran this test in current master and it passed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650692254
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652826043
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652788243
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649269967
Let me see if I can get a more generalized solution out today.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649942950
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124524/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200055
**[Test build #125268 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125268/testReport)** for PR 28898 at commit [`a0998ae`](https://github.com/apache/spark/commit/a0998ae4166efff59b62971e98945e201809fa15).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657154729
**[Test build #125699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125699/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649199639
btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499494
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for window functions") {
Review comment:
`Nested field pruning for window functions` -> `Nested field pruning for Window`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656075862
**[Test build #125466 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125466/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200009
Looks okay. cc: @viirya @dongjoon-hyun
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r453294196
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,60 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ |)
+ |select name.first, rank from contact_rank
+ |where name.first = 'Jane' AND rank = 1
+ |""".stripMargin
+ val query1 = sql(windowSql)
+ checkScan(query1, "struct<id:int,name:struct<first:string>,address:string>")
+ checkAnswer(query1, Row("Jane", 1) :: Nil)
Review comment:
Changed
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652788243
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663364305
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
gatorsmile commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649087502
cc @viirya @cloud-fan
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657904652
**[Test build #125795 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125795/testReport)** for PR 28898 at commit [`2637974`](https://github.com/apache/spark/commit/2637974650f232a6aad83e0e4a3e1fdf03def401).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652831500
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649937189
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650079629
**[Test build #124540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124540/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650853830
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652170847
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649942946
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653012314
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649937189
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548001
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+ .orderBy($"${aliases(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for Filter with other operators") {
Review comment:
Yep, I will include them in the test names.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448718366
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
Take a look at this query:
```
with contact_rank as (
select row_number() over (partition by address order by id desc) as rank,
contacts.*
from contacts
order by name.last, name.first
)
select name.first, rank from contact_rank
```
```
nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
.groupBy(_.references.head)
```
returns:
```
0 = {Tuple2@15534} "(name#46,List(name#46.first))"
1 = {Tuple2@15535} "(name#46,List(name#46.last, name#46.first))"
```
Basically `name#46` is the same attribute of type `AttributeReference`. The only attribute that is different is `qualifier`. Hence such change is needed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650080284
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649912241
**[Test build #124525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124525/testReport)** for PR 28898 at commit [`4c705bd`](https://github.com/apache/spark/commit/4c705bd5e7cbeae2603afe799a338e068c35923c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520292
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650079629
**[Test build #124540 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124540/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649884335
This is not a bugfix, so we will merge this commit only into master(v3.1.0).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451223742
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
Review comment:
It is pruning but not generating aliases.
https://github.com/apache/spark/pull/28898/files/a0998ae4166efff59b62971e98945e201809fa15#diff-d87f0060a604ac3e7149bd108fd548a5R505
This line suggests it only select `first` instead of the whole column `name`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655680974
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655199003
@viirya @maropu Rebased with the current master.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649990865
**[Test build #124525 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124525/testReport)** for PR 28898 at commit [`4c705bd`](https://github.com/apache/spark/commit/4c705bd5e7cbeae2603afe799a338e068c35923c).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649942879
**[Test build #124524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124524/testReport)** for PR 28898 at commit [`3c8cf11`](https://github.com/apache/spark/commit/3c8cf110b19bc5d0c9e89a8a031e6e4a557aa1b3).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652830838
**[Test build #124878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124878/testReport)** for PR 28898 at commit [`2bd84a4`](https://github.com/apache/spark/commit/2bd84a4c7bdd4096712327784351695d53bf704c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655817273
**[Test build #125392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125392/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656497633
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656948958
**[Test build #125654 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125654/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657154729
**[Test build #125699 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125699/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650692166
**[Test build #124592 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124592/testReport)** for PR 28898 at commit [`acce8c5`](https://github.com/apache/spark/commit/acce8c5d8d51bae5f981e56a8811f075cb07d214).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448705642
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -176,9 +192,16 @@ object NestedColumnAliasing {
// By default, ColumnPruning rule uses `attr` already.
if (nestedFieldToAlias.nonEmpty &&
nestedFieldToAlias
- .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
- .sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum < totalFieldNum(attr._2)) {
Review comment:
Updated.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657178726
**[Test build #125699 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125699/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662257917
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657178823
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664067577
Retest this please.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663378385
**[Test build #126477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126477/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708345
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656074434
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445321841
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
case _: Sample => true
case _: RepartitionByExpression => true
case _: Join => true
+ case x: Filter => x.child match {
+ case _: Window => true
Review comment:
I see, it is due to predicate pushdown rule. I think we need general solution as @maropu said.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451738919
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
Review comment:
Good point. Removed such test.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448792946
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
I agree, but I think this is outside this PR's scope that I first want to iterate this thing safely.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448712484
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
Why need this change?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652044031
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657994219
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656497633
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448035514
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ Some((attr.exprId, nestedFieldToAlias))
Review comment:
?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649991400
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656496831
**[Test build #125536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458545107
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+ .orderBy($"${aliases(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for Filter with other operators") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query1 = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val aliases1 = collectGeneratedAliases(optimized1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+ .where($"window" === 1 && $"${aliases1(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+
+ val query2 = contact.sortBy($"name.first".asc)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized2 = Optimize.execute(query2)
+ val aliases2 = collectGeneratedAliases(optimized2)
+ val expected2 = contact
+ .select($"name.first".as(aliases2(1)))
+ .sortBy($"${aliases2(1)}".asc)
+ .select($"${aliases2(1)}".as(aliases2(0)))
+ .where($"${aliases2(0)}" === "a")
+ .select($"${aliases2(0)}".as("first"))
+ .analyze
+ comparePlans(optimized2, expected2)
Review comment:
Shall we move this test case into `test("Nested field pruning for Sort")`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r453294078
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,60 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ |)
+ |select name.first, rank from contact_rank
+ |where name.first = 'Jane' AND rank = 1
+ |""".stripMargin
+ val query1 = sql(windowSql)
+ checkScan(query1, "struct<id:int,name:struct<first:string>,address:string>")
+ checkAnswer(query1, Row("Jane", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in window function and then order by") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ | order by name.last, name.first
+ |)
+ |select name.first, rank from contact_rank
+ |""".stripMargin
+ val query1 = sql(windowSql)
+ checkScan(query1, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+ checkAnswer(query1,
+ Row("Jane", 1) ::
+ Row("John", 1) ::
+ Row("Janet", 1) ::
+ Row("Jim", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in Sort") {
+ val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+ checkScan(query1, "struct<name:struct<first:string,last:string>>")
+ checkAnswer(query1,
+ Row("Jane", "Doe") ::
+ Row("Janet", "Jones") ::
+ Row("Jim", "Jones") ::
+ Row("John", "Doe") :: Nil)
+
+ val query2 = sql("select name.first, name.last from contacts sort by name.first, name.last")
+ checkScan(query2, "struct<name:struct<first:string,last:string>>")
+ checkAnswer(query1,
Review comment:
Good catch. sort by is a local sort, so I also updated this test to do repartition first to make results more predictable.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445935219
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,14 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ case Project(projectList, Filter(condition, child))
Review comment:
I think we better leave a few comment explaining this case.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657230025
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448743319
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
I see. I think we can simply clean up `qualifier` of returned `AttributeReference` in `collectRootReferenceAndExtractValue`.
We don't need the `qualifier`. We just use `exprId` and `dataType`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548850
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,20 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ /**
+ * This pattern is needed to support [[Filter]] plan cases like
+ * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., [[Window]]).
+ * The reason why we don't simply add [[Filter]] in `canProjectPushThrough` is that
+ * the optimizer can hit an infinite loop during the [[PushDownPredicates]] rule.
+ */
+ case Project(projectList, Filter(condition, child))
Review comment:
BTW, it's logically a little weird to me because the second pattern looks narrower than the first pattern. In Scala, we usually use specific patterns first. I'm saying that `case Project(projectList, Filter(condition, child))` is more specific than the previous pattern `case Project(projectList, child)`. Can we switch this case (line 48) and the previous case (line 34). Does it break something?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656497640
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125536/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445512898
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
Btw, I feel the title and the PR description are not accurate because this PR is not only for supporting the window case in nested pruning. Could you make them clearer?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448753272
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
The problem is your approach is, which one is selected from the `ExtractValue`s with different qualifier, is non-deterministic.
Later when we query the `nestedFieldToAlias` map, you might fail to find the corresponding item from the map due to qualifier difference.
I think the safer approach is to clean up all qualifier.
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
oh, I think it is also due different `qualifier` in the `ExtractValue`s. As above, I think we can just clean up `qualifier`.
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
Like:
```scala
def removeQualifier(f: ExtractValue): ExtractValue = {
f.transform {
case a: AttributeReference => a.withQualifier(Seq.empty)
}.asInstanceOf[ExtractValue]
}
```
```scala
val dedupNestedFields = nestedFields.filter {
case e @ (_: GetStructField | _: GetArrayStructFields) =>
val child = e.children.head
nestedFields.forall(f => child.find(_.semanticEquals(f)).isEmpty)
case _ => true
}.map(removeQualifier)
```
And when we need to query the map, we do:
```scala
nestedFieldToAlias.contains(removeQualifier(f))
```
or
```scala
nestedFieldToAlias(removeQualifier(f)).toAttribute
```
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
We don't need to change `collectRootReferenceAndExtractValue`.
```scala
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
.groupBy(_.references.head.withQualifier(Seq.empty))
...
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual edited a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual edited a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650024534
Jenkins, retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649954378
**[Test build #124529 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124529/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657993700
**[Test build #125795 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125795/testReport)** for PR 28898 at commit [`2637974`](https://github.com/apache/spark/commit/2637974650f232a6aad83e0e4a3e1fdf03def401).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200055
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653012320
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124882/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652842493
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652044031
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662241141
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r454012695
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,67 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ |)
+ |select name.first, rank from contact_rank
+ |where name.first = 'Jane' AND rank = 1
+ |""".stripMargin
+ val query = sql(windowSql)
+ checkScan(query, "struct<id:int,name:struct<first:string>,address:string>")
+ checkAnswer(query, Row("Jane", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in window function and then order by") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ | order by name.last, name.first
+ |)
+ |select name.first, rank from contact_rank
+ |""".stripMargin
+ val query = sql(windowSql)
+ checkScan(query, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+ checkAnswer(query,
+ Row("Jane", 1) ::
+ Row("John", 1) ::
+ Row("Janet", 1) ::
+ Row("Jim", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in Sort") {
+ val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+ checkScan(query1, "struct<name:struct<first:string,last:string>>")
+ checkAnswer(query1,
+ Row("Jane", "Doe") ::
+ Row("Janet", "Jones") ::
+ Row("Jim", "Jones") ::
+ Row("John", "Doe") :: Nil)
+
+ // Create a repartitioned view because `SORT BY` is a local sort
+ sql("select * from contacts").repartition(1).createOrReplaceTempView("tmp_contacts")
+ val sortBySql =
Review comment:
Wrap with a `withTempView`? Why can't use `contacts`? Is local sort any different to the test here?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458547437
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+ .orderBy($"${aliases(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for Filter with other operators") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query1 = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val aliases1 = collectGeneratedAliases(optimized1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+ .where($"window" === 1 && $"${aliases1(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+
+ val query2 = contact.sortBy($"name.first".asc)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized2 = Optimize.execute(query2)
+ val aliases2 = collectGeneratedAliases(optimized2)
+ val expected2 = contact
+ .select($"name.first".as(aliases2(1)))
+ .sortBy($"${aliases2(1)}".asc)
+ .select($"${aliases2(1)}".as(aliases2(0)))
+ .where($"${aliases2(0)}" === "a")
+ .select($"${aliases2(0)}".as("first"))
+ .analyze
+ comparePlans(optimized2, expected2)
+
+ val query3 = contact.distribute($"name.first")(100)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized3 = Optimize.execute(query3)
+ val aliases3 = collectGeneratedAliases(optimized3)
+ val expected3 = contact
+ .select($"name.first".as(aliases3(1)))
+ .distribute($"${aliases3(1)}")(100)
+ .select($"${aliases3(1)}".as(aliases3(0)))
+ .where($"${aliases3(0)}" === "a")
+ .select($"${aliases3(0)}".as("first"))
+ .analyze
+ comparePlans(optimized3, expected3)
+
+ val department = LocalRelation(
+ 'depID.int,
+ 'personID.string)
+ val query4 = contact.join(department, condition = Some($"id" === $"depID"))
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized4 = Optimize.execute(query4)
+ val aliases4 = collectGeneratedAliases(optimized4)
+ val expected4 = contact
+ .select($"id", $"name.first".as(aliases4(1)))
+ .join(department.select('depID), condition = Some($"id" === $"depID"))
+ .select($"${aliases4(1)}".as(aliases4(0)))
+ .where($"${aliases4(0)}" === "a")
+ .select($"${aliases4(0)}".as("first"))
+ .analyze
+ comparePlans(optimized4, expected4)
+
+ def runTest(basePlan: LogicalPlan => LogicalPlan): Unit = {
+ val query = basePlan(contact)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = basePlan(contact
+ .select($"name.first".as(aliases(0))))
+ .where($"${aliases(0)}" === "a")
+ .select($"${aliases(0)}".as("first"))
+ .analyze
+ comparePlans(optimized, expected)
+ }
+ Seq(
+ (plan: LogicalPlan) => plan.limit(100),
+ (plan: LogicalPlan) => plan.repartition(100),
+ (plan: LogicalPlan) => Sample(0.0, 0.6, false, 11L, plan)).foreach { base =>
+ runTest(base)
+ }
Review comment:
This is a test the combination of `Filter-> Sample/GlobalLimit/LocalLimit/Repartition`, so that's why it's under this test name -- to test for the combination of `Filter` and other children that can be pushed through.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662257917
@frankyin-factual . Thank you for updating. In general, it's a nice improvement contribution.
- For the test cases, you may have a different idea.
- For the `unapply` pattern stuff, I believe we need more comment on that code path because it looks suspicious logically.
I'll review tomorrow again with a fresh eye and build and test more by myself. That helps me review. (For me, it's 11PM night since I'm at PST timezone.)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r454033847
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,67 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ |)
+ |select name.first, rank from contact_rank
+ |where name.first = 'Jane' AND rank = 1
+ |""".stripMargin
+ val query = sql(windowSql)
+ checkScan(query, "struct<id:int,name:struct<first:string>,address:string>")
+ checkAnswer(query, Row("Jane", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in window function and then order by") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ | order by name.last, name.first
+ |)
+ |select name.first, rank from contact_rank
+ |""".stripMargin
+ val query = sql(windowSql)
+ checkScan(query, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+ checkAnswer(query,
+ Row("Jane", 1) ::
+ Row("John", 1) ::
+ Row("Janet", 1) ::
+ Row("Jim", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in Sort") {
+ val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+ checkScan(query1, "struct<name:struct<first:string,last:string>>")
+ checkAnswer(query1,
+ Row("Jane", "Doe") ::
+ Row("Janet", "Jones") ::
+ Row("Jim", "Jones") ::
+ Row("John", "Doe") :: Nil)
+
+ // Create a repartitioned view because `SORT BY` is a local sort
+ sql("select * from contacts").repartition(1).createOrReplaceTempView("tmp_contacts")
+ val sortBySql =
Review comment:
Updated. Thanks.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649907836
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652826054
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124842/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708222
**[Test build #124592 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124592/testReport)** for PR 28898 at commit [`acce8c5`](https://github.com/apache/spark/commit/acce8c5d8d51bae5f981e56a8811f075cb07d214).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649311526
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649907836
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499537
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for window functions") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+ val query = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases(1)}".as(aliases(0)), $"window")
+ .where($"window" === 1 && $"${aliases(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for orderBy") {
Review comment:
Actually, `Nested field pruning for Sort`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653010073
**[Test build #124882 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124882/testReport)** for PR 28898 at commit [`b5292bd`](https://github.com/apache/spark/commit/b5292bd9b602e32e7f460a2b1b69ddf0f3633bf3).
* This patch **fails PySpark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448759360
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
I think you might reply in the wrong comments, but for that `groupBy`, only `exprId` and `dataType` is referenced. So the situation you described shouldn't matter.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652734341
**[Test build #124842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124842/testReport)** for PR 28898 at commit [`d043e38`](https://github.com/apache/spark/commit/d043e3846432981ac7f9b85cb4515499ab6f0118).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068067
**[Test build #126598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126598/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448793072
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
Sure.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652841875
**[Test build #124882 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124882/testReport)** for PR 28898 at commit [`b5292bd`](https://github.com/apache/spark/commit/b5292bd9b602e32e7f460a2b1b69ddf0f3633bf3).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649976637
**[Test build #124527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124527/testReport)** for PR 28898 at commit [`ab39d24`](https://github.com/apache/spark/commit/ab39d245660c16c0c11d0a37f73f84f74afd7951).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520197
**[Test build #125565 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656944368
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662240866
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448715891
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
Why we need to deduplicate again?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662266294
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445903857
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
Yeah, I will update this PR later tonight.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199334
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649976746
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656523393
This might be a record for how fast the test is failing. :(
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649907459
**[Test build #124524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124524/testReport)** for PR 28898 at commit [`3c8cf11`](https://github.com/apache/spark/commit/3c8cf110b19bc5d0c9e89a8a031e6e4a557aa1b3).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656944368
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446065768
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,14 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ case Project(projectList, Filter(condition, child))
Review comment:
+1
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662240643
Retest this please.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657904956
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655334501
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649310567
**[Test build #124511 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124511/testReport)** for PR 28898 at commit [`ef21b63`](https://github.com/apache/spark/commit/ef21b6352a825c3c779f8fd5ffa6025ec77d372e).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657154379
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-647849466
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199334
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu closed pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
maropu closed pull request #28898:
URL: https://github.com/apache/spark/pull/28898
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652734686
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649954884
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662240866
**[Test build #126296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126296/testReport)** for PR 28898 at commit [`2637974`](https://github.com/apache/spark/commit/2637974650f232a6aad83e0e4a3e1fdf03def401).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652842493
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548850
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,20 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ /**
+ * This pattern is needed to support [[Filter]] plan cases like
+ * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., [[Window]]).
+ * The reason why we don't simply add [[Filter]] in `canProjectPushThrough` is that
+ * the optimizer can hit an infinite loop during the [[PushDownPredicates]] rule.
+ */
+ case Project(projectList, Filter(condition, child))
Review comment:
BTW, it's logically a little weird because the second pattern is narrower than the first pattern. In Scala, we usually use specific patterns first. I'm saying that `case Project(projectList, Filter(condition, child))` is more specific than the previous pattern `case Project(projectList, child)`. Can we switch this case (line 48) and the previous case (line 34). Does it break something?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499982
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,40 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as __rank,
Review comment:
nit: `__rank` -> `rank`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445251813
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
case _: Sample => true
case _: RepartitionByExpression => true
case _: Join => true
+ case x: Filter => x.child match {
+ case _: Window => true
Review comment:
Why we need this? We cannot support the window case only with the change `case _: Window => true`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652043485
**[Test build #124691 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124691/testReport)** for PR 28898 at commit [`c95633e`](https://github.com/apache/spark/commit/c95633e7a242e622900c596c03dd7d3c06441732).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650192029
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445903069
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
> How about use my proposal at #28898 (review)?
If we cannot, yea, I think we need special handling for `Filter` as @viirya suggested above.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655331929
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451224944
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+
+ val query2 = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized2 = Optimize.execute(query2)
+ val aliases2 = collectGeneratedAliases(optimized2)
+ val expected2 = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases2(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases2(1)}".as(aliases2(0)))
+ .orderBy($"${aliases2(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized2, expected2)
+ }
+
+ test("Nested field pruning for Filter") {
Review comment:
Name changed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649152732
ok to test
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655334501
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655633032
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125337/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520292
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708349
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124592/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663533813
**[Test build #126477 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/126477/testReport)** for PR 28898 at commit [`a0b8d07`](https://github.com/apache/spark/commit/a0b8d070f027460dd1e5fdbd7dc35d0440450b0a).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656514679
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663534557
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445935343
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for window functions") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+ val query1 = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val aliases1 = collectGeneratedAliases(optimized1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+ .where($"window" === 1 && $"${aliases1(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+ }
+
+ test("Nested field pruning for orderBy") {
+ val query1 = contact.select($"name.first", $"name.last")
+ .orderBy($"name.first".asc, $"name.last".asc)
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val aliases1 = collectGeneratedAliases(optimized1)
+ val expected1 = contact
+ .select($"name.first",
+ $"name.last",
+ $"name.first".as(aliases1(0)),
+ $"name.last".as(aliases1(1)))
+ .orderBy($"${aliases1(0)}".asc, $"${aliases1(1)}".asc)
+ .select($"first", $"last")
+ .analyze
+ comparePlans(optimized1, expected1)
+ }
+
+ test("Nested field pruning for sirtBy") {
Review comment:
Do you mean sortBy?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448806690
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,12 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
+ // The above groupBy is to avoid situation like the following situation.
+ // For example, `exprIdA -> List(a, b)` and `exprIdA -> List(c, d)`
Review comment:
Just pushed another comment.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448774792
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
No, I mean this comment thread.
I am not sure if you are aware of it. The reason you need to deduplicate here, is because the semantically same `ExtractValue`s apply on attributes with different qualifier, e.g. there are two `name.first`, but one refers to `name` with qualifier `a` and another refers to qualifier `b`.
I did a test using your query and cleaned up all qualifiers as I showed, it works well.
And what I said in above comment is, you select arbitrary one `ExtractValue` from these `ExtractValue` with different qualifiers, but later we will look into the map using given `ExtractValue`. You might fail a case that you select the `name.first` with qualifier `a`, but later you look at the map using `name.first` with qualifier `b`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-647849466
Can one of the admins verify this patch?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445623584
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
How about use my proposal?
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
I don't think this is correct fix. It will push through a child that should not be pushed through.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663534557
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652825077
**[Test build #124842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124842/testReport)** for PR 28898 at commit [`d043e38`](https://github.com/apache/spark/commit/d043e3846432981ac7f9b85cb4515499ab6f0118).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448713879
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -152,7 +168,7 @@ object NestedColumnAliasing {
val exclusiveAttrSet = AttributeSet(exclusiveAttrs ++ otherRootReferences)
val aliasSub = nestedFieldReferences.asInstanceOf[Seq[ExtractValue]]
.filter(!_.references.subsetOf(exclusiveAttrSet))
- .groupBy(_.references.head)
+ .groupBy(t => (t.references.head.exprId, t.references.head.dataType))
Review comment:
https://github.com/apache/spark/pull/28898#discussion_r447950809
So basically the same expression can have different table aliases; thus cause collision during the map insertion.
This is to group by the exprId, which is the unique identifier.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662241141
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r454021579
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,67 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ |)
+ |select name.first, rank from contact_rank
+ |where name.first = 'Jane' AND rank = 1
+ |""".stripMargin
+ val query = sql(windowSql)
+ checkScan(query, "struct<id:int,name:struct<first:string>,address:string>")
+ checkAnswer(query, Row("Jane", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in window function and then order by") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ | order by name.last, name.first
+ |)
+ |select name.first, rank from contact_rank
+ |""".stripMargin
+ val query = sql(windowSql)
+ checkScan(query, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+ checkAnswer(query,
+ Row("Jane", 1) ::
+ Row("John", 1) ::
+ Row("Janet", 1) ::
+ Row("Jim", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in Sort") {
+ val query1 = sql("select name.first, name.last from contacts order by name.first, name.last")
+ checkScan(query1, "struct<name:struct<first:string,last:string>>")
+ checkAnswer(query1,
+ Row("Jane", "Doe") ::
+ Row("Janet", "Jones") ::
+ Row("Jim", "Jones") ::
+ Row("John", "Doe") :: Nil)
+
+ // Create a repartitioned view because `SORT BY` is a local sort
+ sql("select * from contacts").repartition(1).createOrReplaceTempView("tmp_contacts")
+ val sortBySql =
Review comment:
Yeah, because it's a sort per partition, so the result isn't exactly predictable. By doing `repartition`, we can make sure this test isn't flaky.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282323
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458544989
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+ .orderBy($"${aliases(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for Filter with other operators") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query1 = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val aliases1 = collectGeneratedAliases(optimized1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+ .where($"window" === 1 && $"${aliases1(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
Review comment:
Shall we move this test case into `test("Nested field pruning for Window")`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649936846
**[Test build #124527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124527/testReport)** for PR 28898 at commit [`ab39d24`](https://github.com/apache/spark/commit/ab39d245660c16c0c11d0a37f73f84f74afd7951).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657005551
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656520305
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125565/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650014778
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124529/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657229858
**[Test build #125714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125714/testReport)** for PR 28898 at commit [`68cbfd2`](https://github.com/apache/spark/commit/68cbfd24f1f50266c5d1c5dfc24e29699f87c3e3).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653006001
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r453280237
##########
File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##########
@@ -460,6 +460,60 @@ abstract class SchemaPruningSuite
checkAnswer(query4, Row(2, null) :: Row(2, 4) :: Nil)
}
+ testSchemaPruning("select nested field in window function") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ |)
+ |select name.first, rank from contact_rank
+ |where name.first = 'Jane' AND rank = 1
+ |""".stripMargin
+ val query1 = sql(windowSql)
+ checkScan(query1, "struct<id:int,name:struct<first:string>,address:string>")
+ checkAnswer(query1, Row("Jane", 1) :: Nil)
+ }
+
+ testSchemaPruning("select nested field in window function and then order by") {
+ val windowSql =
+ """
+ |with contact_rank as (
+ | select row_number() over (partition by address order by id desc) as rank,
+ | contacts.*
+ | from contacts
+ | order by name.last, name.first
+ |)
+ |select name.first, rank from contact_rank
+ |""".stripMargin
+ val query1 = sql(windowSql)
+ checkScan(query1, "struct<id:int,name:struct<first:string,last:string>,address:string>")
+ checkAnswer(query1,
Review comment:
query1 -> query
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-663364305
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662266294
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445902694
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
> That won’t work because it seems causing an infinite loop in optimizer. It gives me error messages like running out of max iterations.
>> I see, it is due to predicate pushdown rule.
I don't look into it though, we cannot fix the infinite loop caused by the predicate pushdown rule? If we can put `Filter` in `canProjectPushThrough`, it looks the best.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650014768
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448779163
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -172,13 +188,23 @@ object NestedColumnAliasing {
(f, Alias(f, s"_gen_alias_${exprId.id}")(exprId, Seq.empty, None))
}
+
+ // Do deduplication based on semanticEquals, and then sum.
+ val nestedFieldNum = nestedFieldToAlias
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum
Review comment:
No, I mean you probably misread the code here. This map still returns `ExprId->Seq(name.last, name.first, name.first)` because I didn't change the lookup map here. All I change is to count the field number differently so that it won't trigger the `else` statement -- which means all leaf nodes are covered, no schema pruning is required.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282432
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/126296/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653001957
**[Test build #124878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124878/testReport)** for PR 28898 at commit [`2bd84a4`](https://github.com/apache/spark/commit/2bd84a4c7bdd4096712327784351695d53bf704c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662254203
> "Allow nested schema pruning thru window/sort/filter plans" looks like a little an over-claim to me. Technically, this PR doesn't support all general `Filter` plans, does it? IIUC, this PR only handles `Filter(_, child)` where `child` is true by `canProjectPushThrough`. In other words, if `child` is not that type, this PR cannot push into `Filter` plan.
>
> Although the PR description is correct by mentioning `Project->Filter->[any node can be pruned]`, it would be better avoid the misleading PR title. You can focus on `Window/Sort` on the PR title and PR description still can have your contribution on `Project->Filter->[any node can be pruned]`.
Just changed PR titles to not have `Filter` in it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655368737
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-653012314
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649502299
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655680784
**[Test build #125392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125392/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458550049
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+ .orderBy($"${aliases(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for Filter with other operators") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query1 = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val aliases1 = collectGeneratedAliases(optimized1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases1(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases1(1)}".as(aliases1(0)), $"window")
+ .where($"window" === 1 && $"${aliases1(0)}" === "a")
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
+
+ val query2 = contact.sortBy($"name.first".asc)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized2 = Optimize.execute(query2)
+ val aliases2 = collectGeneratedAliases(optimized2)
+ val expected2 = contact
+ .select($"name.first".as(aliases2(1)))
+ .sortBy($"${aliases2(1)}".asc)
+ .select($"${aliases2(1)}".as(aliases2(0)))
+ .where($"${aliases2(0)}" === "a")
+ .select($"${aliases2(0)}".as("first"))
+ .analyze
+ comparePlans(optimized2, expected2)
+
+ val query3 = contact.distribute($"name.first")(100)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized3 = Optimize.execute(query3)
+ val aliases3 = collectGeneratedAliases(optimized3)
+ val expected3 = contact
+ .select($"name.first".as(aliases3(1)))
+ .distribute($"${aliases3(1)}")(100)
+ .select($"${aliases3(1)}".as(aliases3(0)))
+ .where($"${aliases3(0)}" === "a")
+ .select($"${aliases3(0)}".as("first"))
+ .analyze
+ comparePlans(optimized3, expected3)
+
+ val department = LocalRelation(
+ 'depID.int,
+ 'personID.string)
+ val query4 = contact.join(department, condition = Some($"id" === $"depID"))
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized4 = Optimize.execute(query4)
+ val aliases4 = collectGeneratedAliases(optimized4)
+ val expected4 = contact
+ .select($"id", $"name.first".as(aliases4(1)))
+ .join(department.select('depID), condition = Some($"id" === $"depID"))
+ .select($"${aliases4(1)}".as(aliases4(0)))
+ .where($"${aliases4(0)}" === "a")
+ .select($"${aliases4(0)}".as("first"))
+ .analyze
+ comparePlans(optimized4, expected4)
+
+ def runTest(basePlan: LogicalPlan => LogicalPlan): Unit = {
+ val query = basePlan(contact)
+ .where($"name.first" === "a")
+ .select($"name.first")
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = basePlan(contact
+ .select($"name.first".as(aliases(0))))
+ .where($"${aliases(0)}" === "a")
+ .select($"${aliases(0)}".as("first"))
+ .analyze
+ comparePlans(optimized, expected)
+ }
+ Seq(
+ (plan: LogicalPlan) => plan.limit(100),
+ (plan: LogicalPlan) => plan.repartition(100),
+ (plan: LogicalPlan) => Sample(0.0, 0.6, false, 11L, plan)).foreach { base =>
+ runTest(base)
+ }
Review comment:
I know that but this PR doesn't support `Filter` completely. I believe we had better collect these simple test case addition there.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662266123
@dongjoon-hyun Thanks for reviewing this late.
For the test cases, I think it might be better to group all the `Project->Filter->[any node can be pruned]` cases together because it is the newly introduced path.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458550558
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,144 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+ val query = contact
+ .select($"name.first", winExpr.as('window))
+ .orderBy($"name.last".asc)
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.last".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window", $"${aliases(1)}".as(aliases(0)))
+ .orderBy($"${aliases(0)}".asc)
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized, expected)
+ }
+
+ test("Nested field pruning for Filter with other operators") {
Review comment:
Looks like if I listed all the operations, it will be a lengthy line. So I used `supported operators`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r451393487
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,156 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for Window") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber(), spec)
+
+ val query1 = contact
+ .select($"name.first", winExpr.as('window))
+ .analyze
+ val optimized1 = Optimize.execute(query1)
+ val expected1 = contact
+ .select($"name.first", $"address", $"id")
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"window")
+ .analyze
+ comparePlans(optimized1, expected1)
Review comment:
Exactly, this isn’t improved in this pr as it always works. Not sure whether we want to remove this test.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664715448
Thanks for the check, @dongjoon-hyun and @viirya ! I checked it again and I have no more comment. Merged to master.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-658954329
@dongjoon-hyun friendly bump
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650977869
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657199034
**[Test build #125714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125714/testReport)** for PR 28898 at commit [`68cbfd2`](https://github.com/apache/spark/commit/68cbfd24f1f50266c5d1c5dfc24e29699f87c3e3).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650190831
**[Test build #124540 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124540/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r458548850
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,20 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ /**
+ * This pattern is needed to support [[Filter]] plan cases like
+ * [[Project]]->[[Filter]]->listed plan in `canProjectPushThrough` (e.g., [[Window]]).
+ * The reason why we don't simply add [[Filter]] in `canProjectPushThrough` is that
+ * the optimizer can hit an infinite loop during the [[PushDownPredicates]] rule.
+ */
+ case Project(projectList, Filter(condition, child))
Review comment:
BTW, it's logically a little weird to me because the second pattern looks narrower than the first pattern. In Scala, we usually use specific patterns first. I'm saying that `case Project(projectList, Filter(condition, child))` is more specific than the previous pattern `case Project(projectList, child)`. Can we switch this case (line 48) and the previous case (line 34). Or, does it break something? If switching two patterns breaks something, it might be worth to mention.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-662282323
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655680784
**[Test build #125392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125392/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650852190
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655818014
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-657005354
**[Test build #125654 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125654/testReport)** for PR 28898 at commit [`da85920`](https://github.com/apache/spark/commit/da859203e91a0bc90b017a1557bcf3646733982a).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499675
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -39,6 +39,22 @@ object NestedColumnAliasing {
NestedColumnAliasing.replaceToAliases(plan, nestedFieldToAlias, attrToAliases)
}
+ /**
+ * This is to solve a `LogicalPlan` like `Project`->`Filter`->`Window`.
+ * In this case, `Window` can be plan that is `canProjectPushThrough`.
+ * By adding this, it allows nested columns to be passed onto next stages.
+ * Currently, not adding `Filter` into `canProjectPushThrough` due to
+ * infinitely loop in optimizers during the predicate push-down rule.
+ */
+
Review comment:
nit: remove this unnecessary blank.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649153978
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448035806
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ Some((attr.exprId, nestedFieldToAlias))
} else {
None
}
}
+ .groupBy(_._1) // To fix same ExprId mapped to different attribute instance
Review comment:
You meant this fix is only for the Window case?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445510771
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -32,7 +32,9 @@ object NestedColumnAliasing {
def unapply(plan: LogicalPlan): Option[LogicalPlan] = plan match {
case Project(projectList, child)
- if SQLConf.get.nestedSchemaPruningEnabled && canProjectPushThrough(child) =>
+ if SQLConf.get.nestedSchemaPruningEnabled &&
+ (canProjectPushThrough(child) ||
+ getChild(child).exists(canProjectPushThrough)) =>
Review comment:
Is this correct? I'm not 100% sure that this matching case can handle this condition: `Project->[*Any* logical unary node]->[Logical node that can be pushed through]`. Anyway, we need more tests for this change.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652669311
**[Test build #124822 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124822/testReport)** for PR 28898 at commit [`2bc6a1a`](https://github.com/apache/spark/commit/2bc6a1ae1f2fb7c657d95d5abd92615fdc95eaef).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650708345
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655200336
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r447948500
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -178,11 +195,16 @@ object NestedColumnAliasing {
nestedFieldToAlias
.map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
.sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ Some((attr.exprId, nestedFieldToAlias))
Review comment:
https://github.com/apache/spark/pull/28898/files#diff-957112380b0a2ef014abc8227d0b70acR479-R496
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649912241
**[Test build #124525 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124525/testReport)** for PR 28898 at commit [`4c705bd`](https://github.com/apache/spark/commit/4c705bd5e7cbeae2603afe799a338e068c35923c).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655330636
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445255141
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
case _: Sample => true
case _: RepartitionByExpression => true
case _: Join => true
+ case x: Filter => x.child match {
+ case _: Window => true
Review comment:
Looks like the plan is a `Project -> Filter -> Window`. If we only do `case _: Window => true`, the projection aliasing won't be available at the `Window` stage, and can't be passed onto later stages described in the ticket.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445294440
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
case _: Sample => true
case _: RepartitionByExpression => true
case _: Join => true
+ case x: Filter => x.child match {
+ case _: Window => true
Review comment:
Then just add `case _: Filter => true`, if you want to let project pushed through `Filter`?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-655633024
Merged build finished. Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652669795
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656434653
**[Test build #125536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport)** for PR 28898 at commit [`a7e885a`](https://github.com/apache/spark/commit/a7e885a3c5f09f9ca623777bdabcd05e664f3774).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656228054
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/125466/
Test FAILed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649936846
**[Test build #124527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124527/testReport)** for PR 28898 at commit [`ab39d24`](https://github.com/apache/spark/commit/ab39d245660c16c0c11d0a37f73f84f74afd7951).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-660563330
@maropu @viirya @dongjoon-hyun friendly bump
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650077585
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649310142
Justed pushed a generalized solution.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r448704588
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -176,9 +192,16 @@ object NestedColumnAliasing {
// By default, ColumnPruning rule uses `attr` already.
if (nestedFieldToAlias.nonEmpty &&
nestedFieldToAlias
- .map { case (nestedField, _) => totalFieldNum(nestedField.dataType) }
- .sum < totalFieldNum(attr.dataType)) {
- Some(attr.exprId -> nestedFieldToAlias)
+ .foldLeft(Seq[ExtractValue]()) {
+ (unique, curr) => if (!unique.exists(curr._1.semanticEquals(_))) {
+ curr._1 +: unique
+ } else {
+ unique
+ }
+ }
+ .map { t => totalFieldNum(t.dataType) }
+ .sum < totalFieldNum(attr._2)) {
Review comment:
Sure.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-652170847
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
Posted by GitBox <gi...@apache.org>.
frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445331787
##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##########
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
case _: Sample => true
case _: RepartitionByExpression => true
case _: Join => true
+ case x: Filter => x.child match {
+ case _: Window => true
Review comment:
Do you have a sample query that produces `Project->Filter->Sample`? I've been trying to come up with a query that generates this plan.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window/sort functions
Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-650014689
**[Test build #124529 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124529/testReport)** for PR 28898 at commit [`652c77f`](https://github.com/apache/spark/commit/652c77fdbbfa468271e783e1492f72f4785c9880).
* This patch **fails due to an unknown error code, -9**.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664068267
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656433112
retest this please
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-656434949
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-664132483
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow nested schema pruning thru window/sort/filter plans
Posted by GitBox <gi...@apache.org>.
maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r446499910
##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasingSuite.scala
##########
@@ -493,6 +491,58 @@ class NestedColumnAliasingSuite extends SchemaPruningTest {
comparePlans(optimized3, expected3)
}
+ test("Nested field pruning for window functions") {
+ val spec = windowSpec($"address" :: Nil, $"id".asc :: Nil, UnspecifiedFrame)
+ val winExpr = windowExpr(RowNumber().toAggregateExpression(), spec)
+ val query = contact.select($"name.first", winExpr.as('window))
+ .where($"window" === 1 && $"name.first" === "a")
+ .analyze
+ val optimized = Optimize.execute(query)
+ val aliases = collectGeneratedAliases(optimized)
+ val expected = contact
+ .select($"name.first", $"address", $"id", $"name.first".as(aliases(1)))
+ .window(Seq(winExpr.as("window")), Seq($"address"), Seq($"id".asc))
+ .select($"first", $"${aliases(1)}".as(aliases(0)), $"window")
+ .where($"window" === 1 && $"${aliases(0)}" === "a")
Review comment:
Just a suggestion: could you remove this `where` in this test, then add a separate test unit like `test("Nested field pruning for Filter") {` for exhaustive filter tests?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org