You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/08 19:55:41 UTC

[GitHub] [spark] amaliujia opened a new pull request, #38166: [SPARK-40713] Improve SET operation support in the proto and the server

amaliujia opened a new pull request, #38166:
URL: https://github.com/apache/spark/pull/38166

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   1. Rename `Union` to `SetOperation` and make it cover `except`, `intersect`, `union`.
   2. Improve server side support for `SetOperation`.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     3. If you fix a bug, you can clarify why it is a bug.
   -->
   Improve SET operation support in the proto and the server.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r1009009746


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -127,15 +127,19 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
-  repeated Relation inputs = 1;
-  UnionType union_type = 2;
-
-  enum UnionType {
-    UNION_TYPE_UNSPECIFIED = 0;
-    UNION_TYPE_DISTINCT = 1;
-    UNION_TYPE_ALL = 2;
+// Relation of type [[SetOperation]]
+message SetOperation {
+  Relation left_input = 1;
+  Relation right_input = 2;
+  SetOpType set_op_type = 3;
+  bool is_all = 4;
+  bool union_by_name = 5;

Review Comment:
   just `by_name`, as it applies to all set operations in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991782526


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -108,15 +108,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {
   repeated Relation inputs = 1;

Review Comment:
   Done. Now the definition is easier to follow: just check if there is left or right. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991787838


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   My understanding is no:
   
   there are 3 unions: `UnionAll`, `UnionByName` and `Union Distinct`. For the third case DataFrame API says to use `DISTINCT(UNION(....))`. For the `UnionAll` itself there is no flag for `isAll` so by default `Union` is `UnionAll`. `Union Distinct` is split to `DISTINCT` + `UNION` which are two nodes.
   
   There should be some context for this difference but I cannot tell why now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991606206


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   Ok maybe we keep the same message as `SetOperation` for now. When we face the need for further optimization on specific case, we can propose ways to help that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r992654735


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   I am fine on the proposal as long as it does not prevent the clients being compatible with existing data frame API. Any extra functionality is fine.
   
   Thanks @cloud-fan for you summary. Let me change this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r1009012506


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -127,15 +127,19 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
-  repeated Relation inputs = 1;
-  UnionType union_type = 2;
-
-  enum UnionType {
-    UNION_TYPE_UNSPECIFIED = 0;
-    UNION_TYPE_DISTINCT = 1;
-    UNION_TYPE_ALL = 2;
+// Relation of type [[SetOperation]]
+message SetOperation {
+  Relation left_input = 1;
+  Relation right_input = 2;
+  SetOpType set_op_type = 3;
+  bool is_all = 4;
+  bool union_by_name = 5;

Review Comment:
   makes sense. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server
URL: https://github.com/apache/spark/pull/38166


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991769760


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -54,9 +55,9 @@ package object dsl {
       }
 
       def join(
-          otherPlan: proto.Relation,
-          joinType: JoinType = JoinType.JOIN_TYPE_INNER,
-          condition: Option[proto.Expression] = None): proto.Relation = {
+        otherPlan: proto.Relation,

Review Comment:
   nit: 4 spaces indentation for parameter list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991874904


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Spark Connect may have many clients, built-in or third-party. I think it's better to make the query plan proto definition general/standard.
   
   It was kind of a mistake that we forgot to add DataFrame APIs for users to specify `ALL | DISTINCT` for set operations. We added the `isAll` flag to the plan later, because the SQL parser started to support it.
   
   `byName` flag is only supported by Union now, but I think other two set operators should support it as well.
   
   To summarize by proposal:
   1. add a `byName` flag to `SetOperation` proto definition
   2. The server translate `Union(..., isAll = false)` to `Distinct(Union)`
   3. The server fails if `byName = true` and the set type is intercept or except.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990998944


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   Because catalyst has 3 plans for set operations, I think it's better to follow it in proto so that they can evolve separately.
   
   To @grundprinzip : we do support `UNION ALL`, but the logical plan for `UNION` is `Distinct(Union(...))`. The `Union` operator does not have a `isAll` flag.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990759375


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.

Review Comment:
   TBH i think the comment was not correct in the first place. I mean while a set can have one element, the question is does it make sense to have a set plan with only one?



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -35,7 +35,7 @@ message Relation {
     Project project = 3;
     Filter filter = 4;
     Join join = 5;
-    Union union = 6;

Review Comment:
   Why is Set better than Union? Just from a SQL naming persepective, Union might be what users are used to?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991787838


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   My understanding is no:
   
   there are 3 unions: `UnionAll`, `UnionByName` and `Union Distinct`. For the third case DataFrame API says to use `DISTINCT(UNION(....))`. For the `UnionAll` itself there is no flag for `isAll` so by default `Union` is `UnionAll`
   
   There should be some context for this difference but I cannot tell why now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991772175


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -54,9 +55,9 @@ package object dsl {
       }
 
       def join(
-          otherPlan: proto.Relation,
-          joinType: JoinType = JoinType.JOIN_TYPE_INNER,
-          condition: Option[proto.Expression] = None): proto.Relation = {
+        otherPlan: proto.Relation,

Review Comment:
   thanks.
   
   My IDE always breaks it.... let me figure out a way to configure IDE to avoid this happening again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991764630


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.

Review Comment:
   I am thinking it should be always two inputs so the implementation should be correct. Though I don't know enough about dataframe API usage so not sure if there could be only one, or more than two, etc.
   
   Given proto is also a API layer, before 3.4 release I think we need to add a lot more comments on proto to clarify each one's semantic. We can finalize how many inputs are possible at that time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991828578


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   We are doing Union with isAll because it is in the same message with intersect and except and they share the same fields.
   
   However I don't agree with it though. Generally speaking if there is something DataFrame decides to do, I think figuring out the context will be useful. People working on Spark are smart. They can of course just does union with isAll but they decided to not do so. There must be a valid reason and that reason might also apply to connect.
   
   So we can choose not to follow what Spark DataFrame does, but I guess we should at leas understand the context and then understand our decision.



##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   We are doing Union with isAll because it is in the same message with intersect and except and they share the same fields.
   
   However I don't agree with it though. Generally speaking if there is something DataFrame decides to do, I think figuring out the context will be useful. People working on Spark are smart. They can of course just does union with isAll but they decided to not do so. There must be a valid reason and that reason might also apply to connect.
   
   So we can choose not to follow what Spark DataFrame does, but I guess we should at least understand the context and then understand our decision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r1019060189


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   The issue is, we can't change proto message freely. It's API, while catalyst plans are internal implementations. That's why we need more considerations when adding proto messages, to make them future-proof.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r992654735


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   I am fine on the proposal as long as it does not prevent the clients being compatible with existing data frame API. Any extra functionality is fine.
   
   For example we still support `df.union().distinct()` meaning `DISTINCT(UNION)` to be, at least one of the options for `UNION(isAll=false)` then I am good.
   
   Thanks @cloud-fan for you summary. Let me change this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990933552


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -35,7 +35,7 @@ message Relation {
     Project project = 3;
     Filter filter = 4;
     Join join = 5;
-    Union union = 6;

Review Comment:
   I am pretty sure`Set operation` is the SQL word to describe `union` `intersect` `except`. But Wenchen suggests to have individual plans for each, in that case this won't matter any more.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991787838


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   My understanding is no:
   
   there are 3 unions: `unionAll`, `unionByName` and `union distinct`. For the third case DataFrame API says to use `DISTINCT(UNION(....))`. For the `UNIONAll` itself there is no flag for `isAll` so by default `Union` is `UnionAll`
   
   There should be some context for this difference but I cannot tell why now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991855071


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Another reason actually to follow Catalyst: our clients, by best effort, should follow existing data frame API to achieve "changing imports and rest of code work". 
   
   So it's expected that users will use union().distinct() to construct the case above. If we ask clients to smartly identify these two API can be coalesce together, that will be too much for clients. Instead clients just convert each API to a plan node and connect proto follow Catalyst then it will work.   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990933157


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   Yeah IIRC `intersect` and `except` supports `all` or `not all` while `union` does not care. In this case separate plans might be better to different their fields. cc @grundprinzip  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990931201


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   can we have 3 individual plans?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990933552


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -35,7 +35,7 @@ message Relation {
     Project project = 3;
     Filter filter = 4;
     Join join = 5;
-    Union union = 6;

Review Comment:
   I am pretty sure`Set operation` is the SQL word to describe `union` `intersect` `except`. But Wenchen suggests to have individual plans for each, in that it won't matter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990689327


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.

Review Comment:
   This documentation (`at least one input must be set`) does not match the implementation `assert(u.getInputsCount == 2, "Union must have 2 inputs")`. To avoid confusion I choose to remove this for now.
   
   Once our implementation is stable, we can go back to re-document the semantic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r990983152


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   Of course union supports `ALL`. Not sure why we would need three plans for this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991783272


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   last question: does the current spark connect client API support specifying `isAll` for Union?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38166:
URL: https://github.com/apache/spark/pull/38166#issuecomment-1284097477

   please fix conflicts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r1018835131


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Hm, sorry for joining this conversation late. Why don't we just match it to SQL plan implementation? If the design is the issue, we can refactor them together?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991828578


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   We are doing Union with isAll because along with intersect and except in the same message.
   
   I don't agree with it though. Generally speaking if there is something DataFrame decides to do, I think figuring out the context will be useful. People working on Spark are smart. They can of course just does union with isAll but they decided to not do so. There must be a valid reason and that reason might also apply to connect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991934444


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Generally, the proto definitions should follow as much as possible the basics for the relational algebra / SQL language. As @cloud-fan mentioned, since there maybe other clients. It's important to not just follow the quirks of the DF API because its the way it's currently implemented.
   
   Having a proper `SetOperation` definition is not stopping us from continuing to have customers call `df.union().distinct()`. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38166:
URL: https://github.com/apache/spark/pull/38166#issuecomment-1272698464

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991741052


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   nvm, Spark Connect DataFrame operations combine parsed plan, not analyzed plan, so we don't need this early Union optimization. A single `SetOperation` should work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991758545


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   That would be great then!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991769532


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -108,15 +108,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {
   repeated Relation inputs = 1;

Review Comment:
   shall we require 2 inputs `left` and `right`? Our SQL parser only generates set operation with 2 children.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991779886


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -108,15 +108,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {
   repeated Relation inputs = 1;

Review Comment:
   I see
   ```
   queryTerm
       : queryPrimary                                                                       #queryTermDefault
       | left=queryTerm {legacy_setops_precedence_enabled}?
           operator=(INTERSECT | UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm  #setOperation
       | left=queryTerm {!legacy_setops_precedence_enabled}?
           operator=INTERSECT setQuantifier? right=queryTerm                                #setOperation
       | left=queryTerm {!legacy_setops_precedence_enabled}?
           operator=(UNION | EXCEPT | SETMINUS) setQuantifier? right=queryTerm              #setOperation
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991814980


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   This is one of the cases where we shouldn't just blindly take what the DataFrame API does but we should think in terms of generalized parse plans. 
   
   
   In this case I much rather have the option on the union for a All rather than doing something implicit. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991855071


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Another reason actually to follow Catalyst: our clients, by best effort, should follow existing data frame API to achieve changing imports and rest of code work. 
   
   So it's expected that users will use union().distinct() to construct the case above. If we ask clients to smartly identify these two API can be coalesce together, that will be too much for clients. Instead clients just convert each API to a plan node and connect proto follow Catalyst then it will work.   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991000364


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   After a second thought, if we treat the proto layer as a standard for general SQL query plans, it doesn't have to be consistent with catalyst. Then a single `SetOperation` LGTM.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991003957


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -107,15 +107,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {

Review Comment:
   Note: Spark has a tricky optimization for a long chain of `Union`. e.g. `df1.union(df2).union(df3).union...`, Spark does not create a left-deep `Union` tree (which makes analyzer/optimizer run very slow), but just a single `Union` with many children. I'm not sure if we want to do this optimization in Spark Connect clients or not. But if we do, we need a special `Union` proto definition which takes a Seq of children, not just left and right.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991769641


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -108,15 +108,17 @@ message Join {
   }
 }
 
-// Relation of type [[Union]], at least one input must be set.
-message Union {
+// Relation of type [[SetOperation]]
+message SetOperation {
   repeated Relation inputs = 1;
-  UnionType union_type = 2;
-
-  enum UnionType {
-    UNION_TYPE_UNSPECIFIED = 0;
-    UNION_TYPE_DISTINCT = 1;
-    UNION_TYPE_ALL = 2;
+  SetOpType set_op_type = 2;
+  bool is_all = 3;
+
+  enum SetOpType {
+    SET_OP_TYPE_UNSPECIFIED = 0;
+    SET_OP_TYPE_INTERSECT = 1;
+    SET_OP_TYPE_ALL = 2;

Review Comment:
   ```suggestion
       SET_OP_TYPE_UNION = 2;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991855071


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Another reason actually to follow Catalyst: our clients, by best effort, should follow existing data frame API to achieve "changing imports and rest of code will work". 
   
   So it's expected that users will use `union().distinct()` to construct `Union(isAll=false)`. If we ask clients to smartly identify these two API can be coalesced together, that will be too much for clients. Instead clients just convert each API to a plan node and connect proto follow Catalyst as `Distinct(Union)` then it will work.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991855071


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Another reason actually to follow Catalyst: our clients, by best effort, should follow existing data frame API to achieve "changing imports and rest of code work". 
   
   So it's expected that users will use union().distinct() to construct the case above. If we ask clients to smartly identify these two API can be coalesced together, that will be too much for clients. Instead clients just convert each API to a plan node and connect proto follow Catalyst then it will work.   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r991828578


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   We are doing Union with isAll because it is in the same message with intersect and except and they share the same fields.
   
   However I don't agree with it though. Generally speaking if there is something DataFrame decides to do, I think figuring out the context will be useful. People working on Spark are smart. They can of course just does union with isAll but they decided to not do so. There must be a valid reason and that reason might also apply to connect.
   
   So we can choose not to follow what Spark DataFrame does, but I guess we should at least should understand the context and understand then our decision.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38166:
URL: https://github.com/apache/spark/pull/38166#issuecomment-1296595438

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38166: [SPARK-40713][CONNECT] Improve SET operation support in the proto and the server

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38166:
URL: https://github.com/apache/spark/pull/38166#discussion_r1018835131


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -81,6 +81,31 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     }
   }
 
+  test("Test union, except, intersect") {
+    for (isAll <- Seq(true, false)) {
+      val connectPlan = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.except(connectTestRelation, isAll))
+      }
+      val sparkPlan = sparkTestRelation.except(sparkTestRelation, isAll)
+      comparePlans(connectPlan.analyze, sparkPlan.analyze, false)
+
+      val connectPlan2 = {
+        import org.apache.spark.sql.connect.dsl.plans._
+        transform(connectTestRelation.intersect(connectTestRelation, isAll))
+      }
+      val sparkPlan2 = sparkTestRelation.intersect(sparkTestRelation, isAll)
+      comparePlans(connectPlan2.analyze, sparkPlan2.analyze, false)
+    }
+
+    val connectPlan3 = {
+      import org.apache.spark.sql.connect.dsl.plans._
+      transform(connectTestRelation.union(connectTestRelation))

Review Comment:
   Hm, sorry for joining this conversation late. Why don't we just match it to SQL plan implementation? If the design is the issue, we can refactor them together? i don't mind either having `SetOperation` or separate `Union`, etc.` but would be great to have them consistent for readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org