You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/21 14:23:34 UTC

[GitHub] [spark] dengziming opened a new pull request, #39158: [SPARK-41354][CONNECT] Implement `DataFrame.repartitionByRange`

dengziming opened a new pull request, #39158:
URL: https://github.com/apache/spark/pull/39158

   ### What changes were proposed in this pull request?
   1. Implement `DataFrame.repartitionByRange` for scala API
   2. Add a new relation `RepartitionByExpression`.
   
   
   ### Why are the changes needed?
   API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, added new API function.
   
   ### How was this patch tested?
   Unit tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dengziming commented on pull request #39158: [SPARK-41354][CONNECT] Add `RepartitionByExpression` to proto

Posted by GitBox <gi...@apache.org>.
dengziming commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1367036120

   > adding the repartitionBy* APIs in Client ?
   
   Do you mean adding them to python client? yes, I'm working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39158: [SPARK-41354][CONNECT] Implement `DataFrame.repartitionByRange`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39158:
URL: https://github.com/apache/spark/pull/39158#discussion_r1055172775


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -656,6 +668,56 @@ package object dsl {
             Repartition.newBuilder().setInput(logicalPlan).setNumPartitions(num).setShuffle(true))
           .build()
 
+      def repartition(partitionExprs: Expression*): Relation = {

Review Comment:
   is this one related to `repartitionByRange` ?



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -57,6 +57,7 @@ message Relation {
     Hint hint = 24;
     Unpivot unpivot = 25;
     ToSchema to_schema = 26;
+    RepartitionByExpression repartition_by_expression = 27;

Review Comment:
   shoule be `RepartitionByRange`?  it's not an experssion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39158: [SPARK-41354][CONNECT] Add `RepartitionByExpression` to proto

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1367192633

   @dengziming thank you so much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39158: [SPARK-41354][CONNECT] Add `RepartitionByExpression` to proto

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1367006064

   Merged into master, thank you @dengziming 
   
   BTW, would you minding also adding the repartitionBy* APIs in Client ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #39158: [SPARK-41354][CONNECT] Add `RepartitionByExpression` to proto

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #39158: [SPARK-41354][CONNECT] Add `RepartitionByExpression` to proto
URL: https://github.com/apache/spark/pull/39158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #39158: [SPARK-41354][CONNECT] Implement `DataFrame.repartitionByRange`

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #39158:
URL: https://github.com/apache/spark/pull/39158#discussion_r1054812719


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -598,3 +599,14 @@ message ToSchema {
   // The Sever side will update the dataframe with this schema.
   DataType schema = 2;
 }
+
+  message RepartitionByExpression {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) The partitioning expressions
+  repeated Expression partition_exprs = 2;
+
+  // (Optional) number of partitions, must be positive.
+  int32 num_partitions = 3;

Review Comment:
   ```suggestion
     optional int32 num_partitions = 3;
   ```



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -372,6 +374,22 @@ class SparkConnectPlanner(session: SparkSession) {
     }
   }
 
+  private def transformRepartitionByExpression(
+      rel: proto.RepartitionByExpression): LogicalPlan = {
+    val numPartitionsOpt = if (rel.getNumPartitions == 0) {
+      None
+    } else {
+      Some(rel.getNumPartitions)
+    }
+    val partitionExpressions = rel.getPartitionExprsList.asScala.map { expr =>
+      transformExpression(expr)
+    }.toSeq

Review Comment:
   ```suggestion
       val partitionExpressions = rel.getPartitionExprsList.asScala.map(transformExpression).toSeq
   ```



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -372,6 +374,22 @@ class SparkConnectPlanner(session: SparkSession) {
     }
   }
 
+  private def transformRepartitionByExpression(
+      rel: proto.RepartitionByExpression): LogicalPlan = {
+    val numPartitionsOpt = if (rel.getNumPartitions == 0) {
+      None

Review Comment:
   please use the actual value in the if branch and `None` in the else



##########
connector/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala:
##########
@@ -705,4 +705,86 @@ class SparkConnectPlannerSuite extends SparkFunSuite with SparkConnectPlanTest {
       }
     }
   }
+
+  test("RepartitionByExpression") {

Review Comment:
   please add a test that would throw an exception during the execution



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -372,6 +374,22 @@ class SparkConnectPlanner(session: SparkSession) {
     }
   }
 
+  private def transformRepartitionByExpression(
+      rel: proto.RepartitionByExpression): LogicalPlan = {
+    val numPartitionsOpt = if (rel.getNumPartitions == 0) {

Review Comment:
   with the optional change you can actually use `if (rel.hasNumPartitions) {`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39158: [SPARK-41354][CONNECT] Implement `DataFrame.repartitionByRange` and `DataFrame.repartitionByRange(Expression*)`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1365666964

   would you mind resolving the conflicts and update the PR description and title (it aims to adds the proto, not for scala API)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dengziming commented on a diff in pull request #39158: [SPARK-41354][CONNECT] Implement `DataFrame.repartitionByRange`

Posted by GitBox <gi...@apache.org>.
dengziming commented on code in PR #39158:
URL: https://github.com/apache/spark/pull/39158#discussion_r1055284521


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -57,6 +57,7 @@ message Relation {
     Hint hint = 24;
     Unpivot unpivot = 25;
     ToSchema to_schema = 26;
+    RepartitionByExpression repartition_by_expression = 27;

Review Comment:
   This is to be consistent with the DataSet API, in DataSet, both `repartition(columns)` an `repartitionByRange(*)` will be converted to a `RepartitionByExpression`, we don't have a `RepartitionByRange` logical plan.



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -656,6 +668,56 @@ package object dsl {
             Repartition.newBuilder().setInput(logicalPlan).setNumPartitions(num).setShuffle(true))
           .build()
 
+      def repartition(partitionExprs: Expression*): Relation = {

Review Comment:
   I added both `repartition` and `repartitionByRange` since they will both be transformed to a `RepartitionByExpression`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #39158: [SPARK-41354][CONNECT] Implement `DataFrame.repartitionByRange`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1363277054

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dengziming commented on pull request #39158: [SPARK-41354][CONNECT] Implement RepartitionByExpression

Posted by GitBox <gi...@apache.org>.
dengziming commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1365670576

   > would you mind resolving the conflicts and update the PR description and title (it aims to adds the proto, not for scala API)?
   
   @zhengruifeng Thank you for you comments, Do you mean change title to `RepartitionByExpression`, I have change it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39158: [SPARK-41354][CONNECT] Implement RepartitionByExpression

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39158:
URL: https://github.com/apache/spark/pull/39158#issuecomment-1366479584

   sorry, would you minding resolving the conflicts? @dengziming 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org