You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/22 08:15:53 UTC

[GitHub] [spark] amaliujia opened a new pull request, #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

amaliujia opened a new pull request, #38347:
URL: https://github.com/apache/spark/pull/38347

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'core/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   1. Support `Range` in Connect proto.
   2. Refactor `SparkConnectDeduplicateSuite`  to `SparkConnectSessionBasedSuite`
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     4. If you fix a bug, you can clarify why it is a bug.
   -->
   Improve API coverage.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002850714


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala:
##########
@@ -32,7 +35,9 @@ trait SparkConnectPlanTestWithSparkSession extends SharedSparkSession with Spark
   override def getSession(): SparkSession = spark
 }
 
-class SparkConnectDeduplicateSuite extends SparkConnectPlanTestWithSparkSession {
+class SparkConnectSessionBasedSuite extends SparkConnectPlanTestWithSparkSession {

Review Comment:
   I am working on a refactoring PR but that is blocked. Let's have a discussion on how to unblock it.
   
   With the refactoring I expect we only have one suite that test Connect proto with DataFrame API.
   
   Before that is done, right now we have two suites, one is with DataFrame API and another one is with Catalyst.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009085869


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -173,6 +175,15 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("Test Range") {
+    comparePlansDatasetLong(connect.range(None, 10, None, None), spark.range(10))

Review Comment:
   Oh the `.toDF()` just convert things into DataFrame. 
   
   It has removed the `comparePlansDatasetLong`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002849365


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;

Review Comment:
   end is not optional, but how do we know if the client forgets to set it? 0 is a valid end value as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002851496


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;

Review Comment:
   Yeah this becomes tricky. Ultimately we can wrap every such field into a message so we always know if that field is set or not set. However that might complicate entire proto too much.. Let's have a discussion on that.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002850206


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala:
##########
@@ -32,7 +35,9 @@ trait SparkConnectPlanTestWithSparkSession extends SharedSparkSession with Spark
   override def getSession(): SparkSession = spark
 }
 
-class SparkConnectDeduplicateSuite extends SparkConnectPlanTestWithSparkSession {
+class SparkConnectSessionBasedSuite extends SparkConnectPlanTestWithSparkSession {

Review Comment:
   so we still have 2 different dsl test suites?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   The second point is an addition for the first point.  If there is a field which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionality. Otherwise we need this way to differentiate `set versus not set`, to adopt default values of Spark (unless we don't care the default values in Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1287884286

   R: @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   2. is an addition for 1.  If there is a field which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionally. Otherwise we need this way to differentiate `set versus not set`, to adopt default values of Spark (unless we don't care the default values in Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010090254


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -217,3 +218,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   Yes let me follow up. I guess I was looking at python side API somehow thus confused myself on the types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009792973


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+    implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+      def range(
+          start: Option[Int],
+          end: Int,
+          step: Option[Int],
+          numPartitions: Option[Int]): Relation = {
+        val range = proto.Range.newBuilder()

Review Comment:
   Note that I need to keep `proto.Range` as `Range` itself is a built-in scala class so we need `proto.` to differentiate for this special case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009836536


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   To really answer your question: if we plan to respect default values for Spark for those optionally fields whose default proto values are different from Spark default values, this is the only way to respect default values for Spark. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1287931807

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1003252350


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   Is this really the best way to express the optionality?



##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+    implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+      def range(
+          start: Option[Int],
+          end: Int,
+          step: Option[Int],
+          numPartitions: Option[Int]): Relation = {
+        val range = proto.Range.newBuilder()

Review Comment:
   I've been explicitly requesting this a couple of times already, as a coding style to always prefix the proto generated classes with their `proto.` prefix. I know it uses a little bit more horizontal space, but at the same time it makes always clear where this particular element comes from which is tremendously useful because we're consistently using the different types between the catalyst API and Spark Connect in the same code paths.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010097145


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -217,3 +218,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   Updating in https://github.com/apache/spark/pull/38460.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009075910


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -173,6 +175,15 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("Test Range") {
+    comparePlansDatasetLong(connect.range(None, 10, None, None), spark.range(10))

Review Comment:
   how about we call `spark.range(10).toDF` then we don't need to add `comparePlansDatasetLong`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009828791


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+    implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+      def range(
+          start: Option[Int],
+          end: Int,
+          step: Option[Int],
+          numPartitions: Option[Int]): Relation = {
+        val range = proto.Range.newBuilder()

Review Comment:
   It makes sense for `SparkConnectPlanner` where Catalyst and Proto are both mixed together, and we are keeping the approach you are asking there.
   
   However this is the Connect DSL that only deal with protos. No Catalyst included in this package: https://github.com/apache/spark/blob/9fc3aa0b1c092ab1f13b26582e3ece7440fbfc3b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala#L17



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009077783


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -173,6 +175,15 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("Test Range") {
+    comparePlansDatasetLong(connect.range(None, 10, None, None), spark.range(10))

Review Comment:
   Let me try to see if it gives an exact plan.
   
   
   Another idea might be we just compare the result through `collect()` so we do not compare the plan on this case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009835612


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   There are two dimensions of things in this area:
   1. Required versus Optional.
   A field is required, meaning it must be set. A field can be optional. Meaning it could be set or not.
   
   2. Field has default value or not.
   A field can have a default value if not set. 
   
   
   The second point is an addition for the first point.  If there is a field which is not set, there could be a default value to be used.
   
   There are special cases that the default value for proto, is the same as the default value that Spark uses. In that case we don't need to differentiate the optionally. Otherwise we need this way to differentiate `set versus not set`, to adopt default values of Spark (unless we don't care the default values in Spark). 
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009839486


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -207,3 +208,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;
+  // Optional. Default value is assigned by 1) SQL conf "spark.sql.leafNodeDefaultParallelism" if
+  // it is set, or 2) spark default parallelism.
+  NumPartitions num_partitions = 4;

Review Comment:
   So in fewer words :) when `num_partitions` is an integer the default value is `0` even if it's not and for scalar types we can't differentiate between present or not. Understanding if `0` is a valid or invalid value defeats the purpose.
   
   Thanks for the additional color!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1297927909

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1295764450

   This PR is ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1297578456

   Conflict resolved


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002849988


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -94,6 +96,25 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformRange(rel: proto.Range): LogicalPlan = {
+    val start = rel.getStart
+    val end = rel.getEnd
+    val step = if (rel.hasStep) {
+      rel.getStep.getStep
+    } else {
+      1
+    }
+    val numPartitions = if (rel.hasNumPartitions) {
+      rel.getNumPartitions.getNumPartitions
+    } else {
+      val conf = new RuntimeConfig(session.sessionState.conf)

Review Comment:
   we can call `session.leafNodeDefaultParallelism`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1002850714


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectSessionBasedSuite.scala:
##########
@@ -32,7 +35,9 @@ trait SparkConnectPlanTestWithSparkSession extends SharedSparkSession with Spark
   override def getSession(): SparkSession = spark
 }
 
-class SparkConnectDeduplicateSuite extends SparkConnectPlanTestWithSparkSession {
+class SparkConnectSessionBasedSuite extends SparkConnectPlanTestWithSparkSession {

Review Comment:
   I am working on a refactoring PR but that is blocked by some gaps between current Connect proto and DataFrame API. Let's have a discussion on how to unblock it.
   
   With the refactoring I expect we only have one suite that test Connect proto with DataFrame API.
   
   Before that is done, right now we have two suites, one is with DataFrame API and another one is with Catalyst.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1010084761


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -217,3 +218,23 @@ message Sample {
     int64 seed = 1;
   }
 }
+
+// Relation of type [[Range]] that generates a sequence of integers.
+message Range {
+  // Optional. Default value = 0
+  int32 start = 1;
+  int32 end = 2;
+  // Optional. Default value = 1
+  Step step = 3;

Review Comment:
   `start`, `end`, `step` should use `int64` @amaliujia 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38347:
URL: https://github.com/apache/spark/pull/38347#discussion_r1009832423


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -175,6 +178,28 @@ package object dsl {
   }
 
   object plans { // scalastyle:ignore
+    implicit class DslMockRemoteSession(val session: MockRemoteSession) {
+      def range(
+          start: Option[Int],
+          end: Int,
+          step: Option[Int],
+          numPartitions: Option[Int]): Relation = {
+        val range = proto.Range.newBuilder()

Review Comment:
   As long as no catalyst is in this package this is good with me. Thanks for clarifying.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto
URL: https://github.com/apache/spark/pull/38347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38347: [SPARK-40883][CONNECT] Support Range in Connect proto

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38347:
URL: https://github.com/apache/spark/pull/38347#issuecomment-1296126263

   PR should be ready for review again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org