You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/17 09:43:57 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

zhengruifeng opened a new pull request, #38686:
URL: https://github.com/apache/spark/pull/38686

   ### What changes were proposed in this pull request?
   Implement `DataFrame.drop` with a proto message
   
   ### Why are the changes needed?
   for api coverage
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   ### How was this patch tested?
   added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #38686:
URL: https://github.com/apache/spark/pull/38686#issuecomment-1323195361

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on PR #38686:
URL: https://github.com/apache/spark/pull/38686#issuecomment-1324436991

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027506855


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame":
         )
 
     def drop(self, *cols: "ColumnOrString") -> "DataFrame":

Review Comment:
   IIUC,  there will be two RPC if we implement it on the client side
   1, `all_cols = self.columns` to fetch the schema;
   2, build the plan
   
   with a dedicated proto mesage, we only need one RPC.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027508681


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -203,6 +204,19 @@ message Sort {
   }
 }
 
+
+// Drop specified columns.
+message Drop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) columns to drop.
+  //
+  // Should contain at least 1 item.
+  repeated Expression cols = 2;

Review Comment:
   here follows the naming in `def drop(col: Column, cols: Column*): DataFrame`
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027505120


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -203,6 +204,19 @@ message Sort {
   }
 }
 
+
+// Drop specified columns.
+message Drop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) columns to drop.
+  //
+  // Should contain at least 1 item.
+  repeated Expression cols = 2;

Review Comment:
   Dataset.drop takes arbitrary expressions into account https://github.com/apache/spark/blob/3b4faaf89ecd1b70d3d9b3c6096e1b70275670fb/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2952-L2957



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1028811655


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("SPARK-41169: Test drop") {
+    // single column
+    val connectPlan = connectTestRelation.drop("id")
+    val sparkPlan = sparkTestRelation.drop("id")
+    comparePlans(connectPlan, sparkPlan)
+
+    // all columns
+    val connectPlan2 = connectTestRelation.drop("id", "name")
+    val sparkPlan2 = sparkTestRelation.drop("id", "name")
+    comparePlans(connectPlan2, sparkPlan2)
+
+    // non-existing column

Review Comment:
   I just checked the implementation of `Dataset.drop`, it supports all kinds of expressions, a expression will be just ignored if it doesn't semanticEquals the columns in current dataframe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027507821


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -523,6 +524,19 @@ class SparkConnectPlanner(session: SparkSession) {
       sameOrderExpressions = Seq.empty)
   }
 
+  private def transformDrop(rel: proto.Drop): LogicalPlan = {
+    assert(rel.getColsCount > 0, s"cols must contains at least 1 item!")
+
+    val cols = rel.getColsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))

Review Comment:
   Do you mean verify for the arrow-based collect?
   since we will remove the json code path, it always fails if there are unsupported types.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1029946709


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame":
         )
 
     def drop(self, *cols: "ColumnOrString") -> "DataFrame":

Review Comment:
   I think we have sort of building the consensus that we prefer re-using the proto than ask clients do duplicate work. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`
URL: https://github.com/apache/spark/pull/38686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1029946709


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame":
         )
 
     def drop(self, *cols: "ColumnOrString") -> "DataFrame":

Review Comment:
   I think we have built the consensus that we prefer re-using the proto than ask clients do duplicate work. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027505888


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("SPARK-41169: Test drop") {
+    // single column
+    val connectPlan = connectTestRelation.drop("id")
+    val sparkPlan = sparkTestRelation.drop("id")
+    comparePlans(connectPlan, sparkPlan)
+
+    // all columns
+    val connectPlan2 = connectTestRelation.drop("id", "name")
+    val sparkPlan2 = sparkTestRelation.drop("id", "name")
+    comparePlans(connectPlan2, sparkPlan2)
+
+    // non-existing column

Review Comment:
   will add



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38686:
URL: https://github.com/apache/spark/pull/38686#issuecomment-1321403447

   I will update this PR after the arrow-based collect is fixed https://github.com/apache/spark/pull/38706 , otherwise, some e2e tests will fail


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #38686: [SPARK-41169][CONNECT][PYTHON] Implement `DataFrame.drop`

Posted by GitBox <gi...@apache.org>.

grundprinzip commented on code in PR #38686:
URL: https://github.com/apache/spark/pull/38686#discussion_r1027094335


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -255,10 +255,21 @@ def distinct(self) -> "DataFrame":
         )
 
     def drop(self, *cols: "ColumnOrString") -> "DataFrame":

Review Comment:
   This is an interesting case where one could argue for implementing the behavior on the client side instead of the server. 



##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -523,6 +524,19 @@ class SparkConnectPlanner(session: SparkSession) {
       sameOrderExpressions = Seq.empty)
   }
 
+  private def transformDrop(rel: proto.Drop): LogicalPlan = {
+    assert(rel.getColsCount > 0, s"cols must contains at least 1 item!")
+
+    val cols = rel.getColsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))

Review Comment:
   This should verify supported types. 



##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##########
@@ -148,6 +148,23 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest {
     comparePlans(connectPlan2, sparkPlan2)
   }
 
+  test("SPARK-41169: Test drop") {
+    // single column
+    val connectPlan = connectTestRelation.drop("id")
+    val sparkPlan = sparkTestRelation.drop("id")
+    comparePlans(connectPlan, sparkPlan)
+
+    // all columns
+    val connectPlan2 = connectTestRelation.drop("id", "name")
+    val sparkPlan2 = sparkTestRelation.drop("id", "name")
+    comparePlans(connectPlan2, sparkPlan2)
+
+    // non-existing column

Review Comment:
   If you treat the dropped columns as expressions we need to add a negative test for unsupported expressions 



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -203,6 +204,19 @@ message Sort {
   }
 }
 
+
+// Drop specified columns.
+message Drop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) columns to drop.
+  //
+  // Should contain at least 1 item.
+  repeated Expression cols = 2;

Review Comment:
   Wondering if the name should be more explicit like "dropped"?



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -203,6 +204,19 @@ message Sort {
   }
 }
 
+
+// Drop specified columns.
+message Drop {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) columns to drop.
+  //
+  // Should contain at least 1 item.
+  repeated Expression cols = 2;

Review Comment:
   Does drop actually support arbitrary expressions? Shouldn't this be a repeated unresolved attribute?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org