You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "wForget (via GitHub)" <gi...@apache.org> on 2023/11/17 07:57:38 UTC

[PR] [SPARK-45943] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

wForget opened a new pull request, #43867:
URL: https://github.com/apache/spark/pull/43867

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   Handle `ReplaceData#groupFilterCondition` generated by `RewriteMergeIntoTable` in `DetermineTableStats`.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   `MergeIntoTable#sourceTable` is used for `ReplaceData#groupFilterCondition` in `RewriteMergeIntoTable`, SourceTable in  `ReplaceData#groupFilterCondition` is resolved and will not be applied to `DetermineTableStats` through `ResolveSubquery#resolveSubQueries`. So, when there is a hive table without stats in `MergeIntoTable#sourceTable`, IllegalStateException will occur.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   added test case
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1832848180

   thanks, merging to master/3.5!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398855143


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2529,19 +2529,19 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
      */
     private def resolveSubQueries(plan: LogicalPlan, outer: LogicalPlan): LogicalPlan = {
       plan.transformAllExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION), ruleId) {
-        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.resolved =>
+        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.analyzed =>
           resolveSubQuery(s, outer)(ScalarSubquery(_, _, exprId))

Review Comment:
   > I added `checkAnalysis` in `resolveSubQuery`, is it ok?
   
   It doesn't seem to work, the test cases failed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "parthchandra (via GitHub)" <gi...@apache.org>.
parthchandra commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1820050010

   Does this fix now mean that the `PruneHiveTablePartitions` rule will no longer be applied to the `ReplaceData.groupFilterCondition`? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1825329646

   > Sorry for the back and forth, but I have a new finding today: `DetermineTableStats` will always set a table stats to the hive table. That said, I think we can mark the `HiveTableRelation` as unresolved if its stats is None?
   
   This may also affect other Rules. Since HiveTableRelation is not resolved, the Project.projectList of the parent plan will not be resolved. I think it may have affected the `ResolveReferences` rule.
   
   ![image](https://github.com/apache/spark/assets/17894939/81f8ba9d-7c13-47e1-846f-d01735908aef)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1826721672

   > Good point. Why is `DetermineTableStats` a post-hoc resolution rule then...
   
   It was introduced in #17015, I don't see more discussion about it. I willl try moving it into ResolutionRules.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1400539713


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2529,19 +2530,19 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
      */
     private def resolveSubQueries(plan: LogicalPlan, outer: LogicalPlan): LogicalPlan = {
       plan.transformAllExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION), ruleId) {
-        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.resolved =>
+        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.analyzed =>

Review Comment:
   Thanks for your help, I will keep trying.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1817311611

   Thank you for checking, @parthchandra .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1825224019

   Sorry for the back and forth, but I have a new finding today: `DetermineTableStats` will always set a table stats to the hive table. That said, I think we can mark the `HiveTableRelation` as unresolved if its stats is None?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1826809747

   I think we should only put a rule in posthoc resolution if it needs the entire plan to be resolved first. `DetermineTableStats` is not the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398788632


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2529,19 +2529,19 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
      */
     private def resolveSubQueries(plan: LogicalPlan, outer: LogicalPlan): LogicalPlan = {
       plan.transformAllExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION), ruleId) {
-        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.resolved =>
+        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.analyzed =>
           resolveSubQuery(s, outer)(ScalarSubquery(_, _, exprId))

Review Comment:
   can we make sure we set the `analyzed` flag in `resolveSubQuery`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1820124616

   Will marking subqueries as analyzed prematurely cause some rules to not take effect?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1826808784

   > > This may also affect other Rules. Since HiveTableRelation is not resolved, the Project.projectList of the parent plan will not be resolved.
   > 
   > Good point. Why is `DetermineTableStats` a post-hoc resolution rule then...
   
   Moving `DetermineTableStats` to resolution rules also seems to work well, do we need to do that? Do other post-hoc resolution rules also need to be applied to subqueries?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1825235036

   > Sorry for the back and forth, but I have a new finding today: `DetermineTableStats` will always set a table stats to the hive table. That said, I think we can mark the `HiveTableRelation` as unresolved if its stats is None?
   
   Thanks for your reply, I think it might work and I will try it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1837128189

   After merging this pr, there are test failures in branch-3.5. Can you take a look at this issue? @wForget 
   
   
   https://github.com/apache/spark/actions/runs/7041075997/job/19163039661
   
   ```
   [info] HiveSourceRowLevelOperationSuite:
   03:30:31.518 WARN org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
   
   03:30:31.535 WARN org.apache.hadoop.hive.metastore.HiveMetaStore: Location: file:/home/runner/work/spark/spark/target/tmp/warehouse-b721773d-f779-4db4-9ebc-ca8315b15f61/hive_table specified for non-external table:hive_table
   
   [info] - SPARK-45943: merge into using hive table without stats *** FAILED *** (391 milliseconds)
   [info]   org.apache.spark.sql.AnalysisException: LEGACY store assignment policy is disallowed in Spark data source V2. Please set the configuration spark.sql.storeAssignmentPolicy to other values.
   [info]   at org.apache.spark.sql.errors.QueryCompilationErrors$.legacyStoreAssignmentPolicyError(QueryCompilationErrors.scala:270)
   [info]   at org.apache.spark.sql.catalyst.analysis.ResolveRowLevelCommandAssignments$.org$apache$spark$sql$catalyst$analysis$ResolveRowLevelCommandAssignments$$validateStoreAssignmentPolicy(ResolveRowLevelCommandAssignments.scala:66)
   [info]   at org.apache.spark.sql.catalyst.analysis.ResolveRowLevelCommandAssignments$$anonfun$apply$2.applyOrElse(ResolveRowLevelCommandAssignments.scala:52)
   [info]   at org.apache.spark.sql.catalyst.analysis.ResolveRowLevelCommandAssignments$$anonfun$apply$2.applyOrElse(ResolveRowLevelCommandAssignments.scala:41)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:170)
   [info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:170)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:323)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:168)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:164)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:32)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:99)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:96)
   [info]   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:32)
   [info]   at org.apache.spark.sql.catalyst.analysis.ResolveRowLevelCommandAssignments$.apply(ResolveRowLevelCommandAssignments.scala:41)
   [info]   at org.apache.spark.sql.catalyst.analysis.ResolveRowLevelCommandAssignments$.apply(ResolveRowLevelCommandAssignments.scala:38)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.
ahshahid commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1832845934

   @wForget I had a non functional email ID set in my settings , so was not receiving any notifcations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398818267


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2529,19 +2529,19 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
      */
     private def resolveSubQueries(plan: LogicalPlan, outer: LogicalPlan): LogicalPlan = {
       plan.transformAllExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION), ruleId) {
-        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.resolved =>
+        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.analyzed =>
           resolveSubQuery(s, outer)(ScalarSubquery(_, _, exprId))

Review Comment:
   I added `checkAnalysis` in `resolveSubQuery`, is it ok?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1399210714


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2505,12 +2505,14 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
         outer: LogicalPlan)(
         f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): SubqueryExpression = {
       val newSubqueryPlan = AnalysisContext.withOuterPlan(outer) {
-        executeSameContext(e.plan)
+        val analyzed = executeSameContext(e.plan)
+        checkAnalysis(analyzed)

Review Comment:
   Thanks for your suggestions, I will try the first, for the second I still need to spend some time understanding it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1816621875

   Failed check is related to #43870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "parthchandra (via GitHub)" <gi...@apache.org>.
parthchandra commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1817310355

   Thanks @dongjoon-hyun, this is exactly the issue I found and the fix is the same !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398708809


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala:
##########
@@ -138,16 +138,26 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
     relation.copy(tableStats = stats)
   }
 
-  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {

Review Comment:
   We can use `plan.transformWithSubqueries`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1399147590


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2505,12 +2505,14 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
         outer: LogicalPlan)(
         f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): SubqueryExpression = {
       val newSubqueryPlan = AnalysisContext.withOuterPlan(outer) {
-        executeSameContext(e.plan)
+        val analyzed = executeSameContext(e.plan)
+        checkAnalysis(analyzed)

Review Comment:
   I see the problem now. Because of CTE, we want to check the analysis for the main query and subqueries at the same time. But now we have to check it earlier for subqueries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.
ahshahid commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1832840458

   > @aokolnychyi @cloud-fan This issue may be caused by `RewriteMergeIntoTable` rule, could you please take a look? also cc @ahshahid
   
   @cloud-fan  Sorry I noticed this just now.. I do not get any email notfications.. I relatively inexperienced in OS Github work . Do you all receive emails or you have to keep checking the GitHub UI for any notifications?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1399149698


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2505,12 +2505,14 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
         outer: LogicalPlan)(
         f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): SubqueryExpression = {
       val newSubqueryPlan = AnalysisContext.withOuterPlan(outer) {
-        executeSameContext(e.plan)
+        val analyzed = executeSameContext(e.plan)
+        checkAnalysis(analyzed)

Review Comment:
   My suggestion:
   1. do not check analysis here, but only set the analyzed flag.
   2. in `CheckAnalysis`, set the subquery plan as not `analyzed` before checking it, or add a flag to `checkAnalysis0` to not skip the checking for subqueries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1400638381


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala:
##########
@@ -96,7 +96,7 @@ case class InlineCTE(alwaysInline: Boolean = false) extends Rule[LogicalPlan] {
         }
         buildCTEMap(child, cteMap, outerCTEId)
 
-      case ref: CTERelationRef =>
+      case ref: CTERelationRef if cteMap.contains(ref.cteId) =>

Review Comment:
   do we still need to change this file?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1817101948

   cc @parthchandra FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1818133148

   @aokolnychyi @cloud-fan This issue may be caused by `RewriteMergeIntoTable` rule, could you please take a look?  also cc @ahshahid


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1826351872

   > This may also affect other Rules. Since HiveTableRelation is not resolved, the Project.projectList of the parent plan will not be resolved.
   
   Good point. Why is `DetermineTableStats` a post-hoc resolution rule then...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1837142397

   seems we should backport SPARK-45975 to https://github.com/apache/spark/pull/43870 @wForget ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1837234483

   Thank you, @LuciferYang .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1400653315


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2507,6 +2510,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
       val newSubqueryPlan = AnalysisContext.withOuterPlan(outer) {
         executeSameContext(e.plan)
       }
+      newSubqueryPlan.setTagValue(RESOLVED_SUBQUERY_TAG, ())

Review Comment:
   fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1400522747


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2529,19 +2530,19 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
      */
     private def resolveSubQueries(plan: LogicalPlan, outer: LogicalPlan): LogicalPlan = {
       plan.transformAllExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION), ruleId) {
-        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.resolved =>
+        case s @ ScalarSubquery(sub, _, exprId, _, _, _) if !sub.analyzed =>

Review Comment:
   OK we can't rely on the `analyzed` flag here. Another idea: let's use `TreeNodeTag`. In `resolveSubQuery`, we attach a `TreeNodeTag` to the resolved subquery, then here we can skip resolving it again if the `TreeNodeTag` is present.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1400652600


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/InlineCTE.scala:
##########
@@ -96,7 +96,7 @@ case class InlineCTE(alwaysInline: Boolean = false) extends Rule[LogicalPlan] {
         }
         buildCTEMap(child, cteMap, outerCTEId)
 
-      case ref: CTERelationRef =>
+      case ref: CTERelationRef if cteMap.contains(ref.cteId) =>

Review Comment:
   no need, reverted



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1820797743

   > Since we run all analyzer rules before marking it as analyzed, we should be fine. But `CheckAnalysis` is the only missing piece I think.
   
   It seems to affect some subsequent rules, like the failed test case `org.apache.spark.sql.SubquerySuite."having with function in subquery"`:
   
   ```
     test("having with function in subquery") {
       checkAnswer(
         sql("select a from l group by 1 having exists (select 1 from r where d < min(b))"),
         Row(null) :: Row(1) :: Row(3) :: Nil)
     }
   ```
   
   `UnresolvedHaving.havingCondition` is a subquery, marked as `resloved` after applying `Analyzer.ResolveSubquery`, if we mark it as `analyzed`, `UpdateOuterReferences#updateOuterReferenceInSubquery` will not take effect because it uses `plan resolveExpressions`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1832991273

   Merged to master and branch-3.5.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398718286


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala:
##########
@@ -138,16 +138,26 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
     relation.copy(tableStats = stats)
   }
 
-  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {

Review Comment:
   Thanks @cloud-fan , I will try it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1826812617

   > I think we should only put a rule in posthoc resolution if it needs the entire plan to be resolved first. `DetermineTableStats` is not the case.
   
   Got it, I'll revert the changes in ResolveSubquery later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Move DetermineTableStats to resolution rules [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #43867: [SPARK-45943][SQL] Move DetermineTableStats to resolution rules
URL: https://github.com/apache/spark/pull/43867


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1820343341

   > Will marking subqueries as analyzed prematurely cause some rules to not take effect?
   
   Since we run all analyzer rules before marking it as analyzed, we should be fine. But `CheckAnalysis` is the only missing piece I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1400637877


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2507,6 +2510,7 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
       val newSubqueryPlan = AnalysisContext.withOuterPlan(outer) {
         executeSameContext(e.plan)
       }
+      newSubqueryPlan.setTagValue(RESOLVED_SUBQUERY_TAG, ())

Review Comment:
   can we set it only inside the `if (newSubqueryPlan.resolved)` branch?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398682303


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala:
##########
@@ -138,16 +138,26 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
     relation.copy(tableStats = stats)
   }
 
-  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {

Review Comment:
   shall we simply make this rule explicitly handle subqueries? I feel we may hit the same issue if dataframe API also supports subqueries and we may have an already resolved subquery in the plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398689493


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala:
##########
@@ -138,16 +138,26 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
     relation.copy(tableStats = stats)
   }
 
-  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {

Review Comment:
   Yes, I like your idea better too, but I haven't figured out how to implement it. If handling this in `ResolveSubquery#resolveSubQueries` would it cause analyzer rules to be applied multiple times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1398710985


##########
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala:
##########
@@ -138,16 +138,26 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
     relation.copy(tableStats = stats)
   }
 
-  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {

Review Comment:
   Oh, actually we can update `ResolveSubquery#resolveSubQueries` to check `subquery.analyzed`, instead of `subquery.resolved`. The rationale is, we should always run the post-hoc resolution rules, even if the plan is already resolved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Determine table stats for `ReplaceData.groupFilterCondition` [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1399149698


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2505,12 +2505,14 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
         outer: LogicalPlan)(
         f: (LogicalPlan, Seq[Expression]) => SubqueryExpression): SubqueryExpression = {
       val newSubqueryPlan = AnalysisContext.withOuterPlan(outer) {
-        executeSameContext(e.plan)
+        val analyzed = executeSameContext(e.plan)
+        checkAnalysis(analyzed)

Review Comment:
   My suggestion:
   1. do not check analysis here, but only set the analyzed flag.
   2. in `CheckAnalysis`, set the subquery plan as not `analyzed` before checking it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "wForget (via GitHub)" <gi...@apache.org>.
wForget commented on PR #43867:
URL: https://github.com/apache/spark/pull/43867#issuecomment-1825193742

   @cloud-fan Could you please continue the review if you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45943][SQL] Add a tag to mark resolved subquery plan for ResolveSubquery [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #43867:
URL: https://github.com/apache/spark/pull/43867#discussion_r1403999630


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -2493,6 +2493,9 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
    * Note: CTEs are handled in CTESubstitution.
    */
   object ResolveSubquery extends Rule[LogicalPlan] {
+    // This tag is used to mark resolved subquery to avoid resolving it repeatedly.

Review Comment:
   Can we add more comments? It's used instead of `plan.resolved` because the post-hoc resolution rules should be run even if the plan is already resolved.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org