You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "ahshahid (via GitHub)" <gi...@apache.org> on 2024/03/08 22:51:19 UTC

[PR] [SPARK-47217][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

ahshahid opened a new pull request, #45446:
URL: https://github.com/apache/spark/pull/45446

   ### What changes were proposed in this pull request?
   The basis of the change is to distinguish and resolve the ambiguity based on the Dataset from which column is extracted  by the user, instead of ExprIds.
   That will result in a consistent and intuitive behaviour and also logically correct. 
   Current code is mixing the resolution basis as sometimes using ExprId and sometimes indirectly using DataSet Id.
   This PR used DataSet Id present in AttributeReference's metadata to see if ambiguity can be resolved logically / sensibly by checking with the DataSet ID's of the joining DataSets.
   
   ### Why are the changes needed?
   While fixing a bug where Ambiguous Column Exception was raised ( which worked fine in earlier versions of spark), came across multiple situations where a particular nested joined DataSet involving self joins, works, but fails when join order is changed or a column extract from dataset involved in join, is treated as unambiguous when used in join condition but same causes ambiguity exception when used in projection ( select)
   There is also an existing test I believe which is falsely passing where resolution of attribute is not happening to the expected Dataset.
   For eg:
   `
   val df1 = Seq((1, 2)).toDF("a", "b")
   val df2 = Seq((1, 2)).toDF("aa", "bb")
   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
   df2("aa"), df1("b"))
   
   df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
   `
   The above works fine, but below throws Exception. The only difference between the two is  that the latter has `select(df1("a")`. But then `df1("a")` works fine as a condition
   `
   val df1 = Seq((1, 2)).toDF("a", "b")
   val df2 = Seq((1, 2)).toDF("aa", "bb")
   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
   df2("aa"), df1("b"))
   
   df1Joindf2.join(df1, df1Joindf2("a") === df1("a")).select(df1("a"))
   `
   
   ### Does this PR introduce _any_ user-facing change?
   Yes.
   It is possible that any Dataset involving self joins which may have  previously been throwing Ambiguity related exceptions are now expected to work , assuming the columns being extracted to be used in APIs are from DataSets being joined at the top most level.
   
   
   ### How was this patch tested?
   Added new tests. Making stricter assertions. Modifying the existing tests in DataFrameSelfJoinTest which are logically having unambiguity based on datasets from which columns are extracted.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on PR #45446:
URL: https://github.com/apache/spark/pull/45446#issuecomment-2004673765

   @peter-toth does your PR for [https://github.com/apache/spark/pull/45552](https://github.com/apache/spark/pull/45552) imply that this PR can be closed ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527235068


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   @peter-toth I agree with your analysis. In the current PR,  the approach I had in mind, was to allow columns from only top Joining dataframes for simplicity purposes . The reason for this thinking was :
   1) It allows predictable behaviour and easier for user to comprehend the outcome.
   2) while resolving does not need to do deep traversals. In case of repeat dataframes , if we reach the leaves, then dataset IDs most likely will clash, so to resolve ambiguity we would have to resort to something like shortest depth etc ( I still use depth but only 1 level deep). 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on PR #45446:
URL: https://github.com/apache/spark/pull/45446#issuecomment-2002310326

   @peter-toth @cloud-fan ,
   IMHO the current idea of spark resolving the attribute to dataframe lower than the top level dataframe(s) , which in process adds missing attribute to various projections in between , can be detrimental to the performance without user being aware of the cause. The scenario which I have in mind is that say user had cached the lower dataframes. Now with the plan implicitly adding missing projects may make those cached plans unusable, without user being aware of the situation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1522868178


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   the purpose of https://github.com/apache/spark/pull/43115 is to use plan id for both Spark Connect and classic Spark. That said, no more already-resolved `AttributeReference`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1524029007


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   > the purpose of #43115 is to use plan id for both Spark Connect and classic Spark. That said, no more already-resolved `AttributeReference`
   
   Further checking the PR #43115  I see that it is substituting  DataSet ID for plan ID for the attributes. Which is I suppose fine, though redundant, because metadata already contains the datasetID . However main issue with the current PR #43115  which makes it untesttable I think is, that for resolution using LogicalPlan ID , the LogicalPlan should contain the Tag PLAN_ID, which is not being set as it is not SQLConnect plan. 
   And I am not that well versed to set the Plan ID  for usual Spark LogicalPlans, for testing purposes.
   But I will see if I can reconcile my code with the logic used in tryResolveDataFrameColumns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527239057


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   yes. PlanId being a single value is a sort of limitation.  Though I would hold back my thoughts because planId usage code , I have not fully grasped, except that SQLConnect code is very sensitive to it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527072175


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   @peter-toth @cloud-fan,
   Attached is the patch which describes the attempt to use planID resolution logic. But that does not work because the planID resolution logic goes till the full depth and even attempt to early end the recursion does not work, as the comparison of the return values of the recursion results, does not contain sufficient data.  And further attempt to somehow reuse that code, is proving too cumbersome for me.
   [patch.txt](https://github.com/apache/spark/files/14622366/patch.txt)
   
   Though I have removed the previously added new class UnresolvedAttributeWithTag.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1525476910


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   @peter-toth @cloud-fan I tried to use the idea of #43115 to return UnresolvedAttributeWithTag ( though implications would be same even if I had returned  UnresolvedAttribute) , but it affects many tests as there are assertions on AttributeReference ( which are now UnresolvedAttr)..
   I will try to see if planIDResolution code can be used ..
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1520961009


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   > can you check out function `tryResolveDataFrameColumns` in this file and see if it matches your expectations?
   
   I will check it out but isn't plan Id tag set only when using  Client ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1523844585


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   @cloud-fan @peter-toth  I applied the patch #43115 and verified that in df.select , the projection is added with Alias having UnresolvedAttribute.   There are many test failures in DataFrameSelfJoinSuite with that change. So atleast with the current code base of spark with patch #43115 is not sufficient to fix new bugs which are encountered  nor the existing tests pass. If planId has to be used to fix the issue so that the current code of tryResolveDataFrameColumns can fix the problem, then atleast it will need modification & debugging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527145045


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have 2 notes to the above:
   - @ahshahid, the following worked in Spark 3.5 but failes in 4.0 after https://github.com/apache/spark/pull/41347 for the same reason as described in the old https://github.com/apache/spark/pull/45343:
     ```
     test("SPARK-47217: DeduplicateRelations issue 4") {
       Seq(true, false).foreach(fail =>
         withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> fail.toString) {
           val df = Seq((1, 2)).toDF("a", "b")
           val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
           val df3 = df.select(df("a"), df("b"))
           val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"), df("a")) // `df("a")` doesn't come from the the join's direct children, but from it's descendants 
           checkAnswer(df4, Row(1, 1) :: Nil)
         }
       )
     }
     ```
     In this test `df("a")`'s expression id gets deduplicated (in the right side of the join) and so the original expression id doesn't work in the final select. But I think this test case proves that we need `tryResolveDataFrameColumns()` like deep recursion when we try resolving by plan ids.
   - @cloud-fan, I think there is different problem with `tryResolveDataFrameColumns()`.
     I did try to use it for "re-resolving" attribute references that became invalid in a quick test: https://github.com/peter-toth/spark/commit/a873c24372b1d87184149bc5e65c96da1b0db879, but a few test cases failed due to some logicalplans can belong to multiple datasets. E.g. if we have:
     ```
     val df = Seq((1, 2)).toDF("a", "b")
     val df2 = df.toDF()
     ``` 
     then the `df` and `df2` shares the same logicalplan instance and we can't store multiple ids in the current `LogicalPlan.PLAN_ID_TAG`.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have 2 notes to the above:
   - @ahshahid, the following worked in Spark 3.5 but failes in 4.0 after https://github.com/apache/spark/pull/41347 for the same reason as described in the old https://github.com/apache/spark/pull/45343:
     ```
     test("SPARK-47217: DeduplicateRelations issue 4") {
       Seq(true, false).foreach(fail =>
         withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> fail.toString) {
           val df = Seq((1, 2)).toDF("a", "b")
           val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
           val df3 = df.select(df("a"), df("b"))
           val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"), df("a")) // `df("a")` doesn't come from the the join's direct children, but from its descendants 
           checkAnswer(df4, Row(1, 1) :: Nil)
         }
       )
     }
     ```
     In this test `df("a")`'s expression id gets deduplicated (in the right side of the join) and so the original expression id doesn't work in the final select. But I think this test case proves that we need `tryResolveDataFrameColumns()` like deep recursion when we try resolving by plan ids.
   - @cloud-fan, I think there is different problem with `tryResolveDataFrameColumns()`.
     I did try to use it for "re-resolving" attribute references that became invalid in a quick test: https://github.com/peter-toth/spark/commit/a873c24372b1d87184149bc5e65c96da1b0db879, but a few test cases failed due to some logicalplans can belong to multiple datasets. E.g. if we have:
     ```
     val df = Seq((1, 2)).toDF("a", "b")
     val df2 = df.toDF()
     ``` 
     then the `df` and `df2` shares the same logicalplan instance and we can't store multiple ids in the current `LogicalPlan.PLAN_ID_TAG`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1520923407


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   Can you also check https://github.com/apache/spark/pull/43115 and see if it fixes your problem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527145045


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have 2 notes to the above:
   - @ahshahid, the following worked in Spark 3.5 but failes in 4.0 after https://github.com/apache/spark/pull/41347 for the same reason as described in the old https://github.com/apache/spark/pull/45343:
     ```
     test("SPARK-47217: DeduplicateRelations issue 4") {
       Seq(true, false).foreach(fail =>
         withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> fail.toString) {
           val df = Seq((1, 2)).toDF("a", "b")
           val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
           val df3 = df.select(df("a"), df("b"))
           val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"), df("a")) // `df("a")` doesn't come the the join's direct children, but from it's descendants 
           checkAnswer(df4, Row(1, 1) :: Nil)
         }
       )
     }
     ```
     In this test `df("a")`'s expression id gets deduplicated (in the right side of the join) and so the original expression id doesn't work in the final select. But I think this test case proves that we need `tryResolveDataFrameColumns()` like deep recursion when we try resolving by plan ids.
   - @cloud-fan, I think there is different problem with `tryResolveDataFrameColumns()`.
     I did try to use it for "re-resolving" attribute references that became invalid in a quick test: https://github.com/peter-toth/spark/commit/a873c24372b1d87184149bc5e65c96da1b0db879, but a few test cases failed due to some logicalplans can belong to multiple datasets. E.g. if we have:
     ```
     val df = Seq((1, 2)).toDF("a", "b")
     val df2 = df.toDF()
     ``` 
     then the `df` and `df2` shares the same logicalplan instance and we can't store multiple ids in the current `LogicalPlan.PLAN_ID_TAG`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527145045


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have 2 notes to the above:
   - @ahshahid, the following worked in Spark 3.5 but failes in 4.0 after https://github.com/apache/spark/pull/41347 for the same reason as described in the old https://github.com/apache/spark/pull/45343:
     ```
     test("SPARK-47217: DeduplicateRelations issue 4") {
       Seq(true, false).foreach(fail =>
         withSQLConf(SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> fail.toString) {
           val df = Seq((1, 2)).toDF("a", "b")
           val df2 = df.select(df("a").as("aa"), df("b").as("bb"))
           val df3 = df.select(df("a"), df("b"))
           val df4 = df2.join(df3, df2("bb") === df("b")).select(df2("aa"), df("a")) // `df("a")` doesn't come from the the join's direct children, but from its descendants 
           checkAnswer(df4, Row(1, 1) :: Nil)
         }
       )
     }
     ```
     In this test `df("a")`'s expression id gets deduplicated (in the right side of the join) and so the original expression id doesn't work in the final select. But I think this test case proves that we need `tryResolveDataFrameColumns()` like deep recursion when we try resolving by plan ids.
   - @cloud-fan, I think there is different problem with `tryResolveDataFrameColumns()`.
     I did try to use it for "re-resolving" attribute references that became invalid in a quick test: https://github.com/peter-toth/spark/commit/a873c24372b1d87184149bc5e65c96da1b0db879, but a few test cases failed due to some logicalplans can belong to multiple datasets. E.g. if we have:
     ```
     val df = Seq((1, 2)).toDF("a", "b")
     val df2 = df.toDF()
     ``` 
     then the `df` and `df2` shares the same logicalplan instance but we can't store multiple ids in the current `LogicalPlan.PLAN_ID_TAG`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1524263718


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have been trying to see how I can integrate #43115 with this PR. One part is deifinitely easily integratable and simplifies the code is to use is to use the DataSet change of adding UnResolvedAttribute. This will simplify the changes in this PR for select and join, etc.
   But I am not  comfortable as of now to have that function create UnresolvedAttribute . Instead prefer UnresolvedAttributeWithTag which wraps the original attribute.
   This PRs logic  of resolution involves identifying first Join ( the top level Join) and then traversing the two legs till either a leaf or a binary node is reached. It does not recurse down further. I am not sure we need resolution to happen below the top level join as the user is expected to extract column attributes from only the Dataset involved in top level Join.
   Secondly the current code of resolution of DataFrame does not use TagId from the UnresolvedAttribute. It uses the Logical Plan's ID. 
   3rd The DataSetId being a Set[Long] instead of single PlanId , is more versatile as it avoids the need of recursion, if at all.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1524263718


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I have been trying to see how I can integrate #43115 with this PR. One part is deifinitely easily integratable and simplifies the code is to use is to use the DataSet change of adding UnResolvedAttribute. This will simplify the changes in this PR for select and join, etc.
   But I am not  comfortable as of now to have that function create UnresolvedAttribute . Instead prefer UnresolvedAttributeWithTag which wraps the original attribute.
   This PRs logic  of resolution involves identifying first Join ( the top level Join) and then traversing the two legs till either a leaf or a binary node is reached. It does not recurse down further. I am not sure we need resolution to happen below the top level join as the user is expected to extract column attributes from only the Dataset involved in top level Join.
   Secondly the current code of resolution of DataFrame  Column does not use Tag from the UnresolvedAttribute. It uses the Logical Plan's ID. 
   3rd The DataSetId being a Set[Long] instead of single PlanId , is more versatile as it avoids the need of recursion, if at all.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1522868178


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   the purpose of https://github.com/apache/spark/pull/43115 is to use plan for both Spark Connect and classic Spark.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   the purpose of https://github.com/apache/spark/pull/43115 is to use plan id for both Spark Connect and classic Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on PR #45446:
URL: https://github.com/apache/spark/pull/45446#issuecomment-2004841887

   > @peter-toth does your PR for #45552 imply that this PR can be closed ?
   
   I am wondering if a combo of your PR [https://github.com/apache/spark/pull/45552](https://github.com/apache/spark/pull/45552) and this PR would solve the issue satisfactorily.
   i.e
   1) The first preference to resolution if ( DATASET_ID_TAG is present in UnResolvedAttribute) would be the code of this PR. This will resolve if there is apparent unambiguity based on user's choice of dataframe to extract columns ( given that the dataframe used is leg of the top level join.
   2) If the resolution is not possible or ambiguous , delegate it to PlanID resolution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1520922028


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   can you check out function `tryResolveDataFrameColumns` in this file and see if it matches your expectations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1527574690


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   Maybe we could fix this issue with a small change like this: https://github.com/apache/spark/pull/45552



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "peter-toth (via GitHub)" <gi...@apache.org>.

peter-toth commented on PR #45446:
URL: https://github.com/apache/spark/pull/45446#issuecomment-2007281672

   Let's keep this PR open till we fully figure out how to deal with these issues: https://github.com/apache/spark/pull/45552#issuecomment-2007037219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1520957284


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   thanks @cloud-fan will check.. The thing is that when Column is extracted from a DataFrame , then it comes as resolved ( and in cases of bugs, it is incorrect ). But still will check 43115



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1521997698


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   > > can you check out function `tryResolveDataFrameColumns` in this file and see if it matches your expectations?
   > 
   > I will check it out but isn't plan Id tag set only when using Client ?
   
   The tryResolveDataFrameColumns will not help for 2 reasons.. 
   1) Plan ID is added only when code path uses SQLConnect
   2) There are cases, where AttributeReference is already added as resolved. That resolution may have been incorrect as it is not based on the deduplication plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1521993187


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   > Can you also check #43115 and see if it fixes your problem?
   
   @cloud-fan I applied the patch to nearly latest master with the tests for Spark-47320 . The new tests are failing along with existing ones, so I suppose the PR #43115  needs to be enhanced 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47320][SQL] : The behaviour of Datasets involving self joins is inconsistent, unintuitive, with contradictions [spark]

Posted by "ahshahid (via GitHub)" <gi...@apache.org>.

ahshahid commented on code in PR #45446:
URL: https://github.com/apache/spark/pull/45446#discussion_r1523727676


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ColumnResolutionHelper.scala:
##########
@@ -477,6 +482,57 @@ trait ColumnResolutionHelper extends Logging with DataTypeErrorsBase {
         assert(q.children.length == 1)
         q.children.head.output
       },
+
+      resolveOnDatasetId = (datasetid: Long, name: String) => {

Review Comment:
   I see. Let me modify the code to use #43115 and some change of mine to see if tryResolveDataFrameColumns can be used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org