You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2023/07/17 10:13:29 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

zhengruifeng opened a new pull request, #42040:
URL: https://github.com/apache/spark/pull/42040

   ### What changes were proposed in this pull request?
   Fix unexpected `AnalysisException` from Spark Connect client
   
   
   ### Why are the changes needed?
   to make PS work on Connect
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes
   
   
   ### How was this patch tested?
   manually check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42040:
URL: https://github.com/apache/spark/pull/42040#issuecomment-1637866212

   not sure whether this works:
   
   - for each analysis rule, if the plan is changed, check whether the output plan (as well as its descendents) contains all the planIDs from the input plan (as well as its descendents), if some planID are missing, attach the missing ones to the output plan;
   - we can still modify some rule if need;
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

Posted by "itholic (via GitHub)" <gi...@apache.org>.
itholic commented on PR #42040:
URL: https://github.com/apache/spark/pull/42040#issuecomment-1637901206

   Thanks for your time on investigating this issue! 👍 
   
   Since this issue is also blocking numerous test cases including the two examples provided in the ticket, I have created a separate temporary test PR here: https://github.com/apache/spark/pull/42041 that cherry-picks this fix and enables all tests related to this issue to verify that the fix is applied correctly to all test cases as well.
   
   If the cause of failure for every test is different and difficult to fix in one PR, we may need to break this issue into multiple sub-tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #42040:
URL: https://github.com/apache/spark/pull/42040#discussion_r1265157852


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##########
@@ -3151,9 +3151,16 @@ class Analyzer(override val catalogManager: CatalogManager) extends RuleExecutor
         // Add Window operators.
         val withWindow = addWindow(windowExpressions, withProject)
 
+        val planId = p.getTagValue(LogicalPlan.PLAN_ID_TAG)
+
         // Finally, generate output columns according to the original projectList.
         val finalProjectList = projectList.map(_.toAttribute)
-        Project(finalProjectList, withWindow)
+        val newProject = Project(finalProjectList, withWindow)
+
+        // retain the plan id used in Spark Connect
+        planId.foreach(newProject.setTagValue(LogicalPlan.PLAN_ID_TAG, _))

Review Comment:
   this is to fix example 2.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala:
##########
@@ -290,6 +292,11 @@ abstract class TypeCoercionBase {
             s.copy(children = newChildren) -> attrMapping
           }
       }
+
+      // retain the plan id used in Spark Connect
+      planId.foreach(newPlan.setTagValue(LogicalPlan.PLAN_ID_TAG, _))

Review Comment:
   this is to fix example 1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42040:
URL: https://github.com/apache/spark/pull/42040#issuecomment-1643204561

   close this one in favor of https://github.com/apache/spark/pull/42086


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client
URL: https://github.com/apache/spark/pull/42040


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #42040: [WIP][SPARK-43611][SQL][PS][CONNCECT] Fix unexpected `AnalysisException` from Spark Connect client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #42040:
URL: https://github.com/apache/spark/pull/42040#issuecomment-1637817969

   In https://github.com/apache/spark/pull/39925, we introduced a new mechanism to resolve expression with specified plan.
   
   However, sometimes the plan ID might be eliminated by the analyzer, and then some expressions can not be correctly resolved, this issue is the No.1 blocker of PS on Connect.
   
   Currently, I investigate the two examples [in the ticket](https://issues.apache.org/jira/browse/SPARK-43611) and check each rule applied to them.
   
   example 1:
   ```
   >>> import pyspark.pandas as ps
   >>> psdf1 = ps.DataFrame({"A": [1, 2, 3]})
   >>> psdf2 = ps.DataFrame({"B": [1, 2, 3]})
   >>> psdf1.append(psdf2)
   ```
   
   example 2:
   ```
   import pyspark.pandas as ps
   import pandas as pd
   
   pdf = pd.DataFrame({"A": [None, 3, None, None], "B": [2, 4, None, 3], "C": [None, None, None, 1], "D": [0, 1, 5, 4],}, columns=["A", "B", "C", "D"],)
   psdf = ps.from_pandas(pdf)
   psdf.backfill()
   ```
   
   In the draft, I modify two rules to retain the plan id. (actually, I modified [ResolveNaturalAndUsingJoin](https://github.com/apache/spark/blob/6161bf44f40f8146ea4c115c788fd4eaeb128769/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3302-L3316) in https://github.com/apache/spark/commit/167bbca49c1c12ccd349d4330862c136b38d4522)
   
   I am wondering whether is there some graceful approach to fix this issue? Otherwise, I'm afraid I will touch more rules.
   
   cc @cloud-fan @HyukjinKwon @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org