You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/08 02:31:23 UTC

[GitHub] [spark] beliefer opened a new pull request, #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

beliefer opened a new pull request, #38973:
URL: https://github.com/apache/spark/pull/38973

   ### What changes were proposed in this pull request?
   Implement `DataFrame.melt` with a proto message
   
   1. Implement `DataFrame.melt` for scala API
   2. Implement `DataFrame.melt`for python API
   
   ### Why are the changes needed?
   for Connect API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'. New API
   
   
   ### How was this patch tested?
   New test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043118120


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,28 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformMelt(rel: proto.Melt): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    if (rel.getValuesList.isEmpty) {
+      Dataset

Review Comment:
   Good idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043045096


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,28 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformMelt(rel: proto.Melt): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    if (rel.getValuesList.isEmpty) {
+      Dataset
+        .ofRows(session, transformRelation(rel.getInput))
+        .melt(ids, rel.getVariableColumnName, rel.getValueColumnName)

Review Comment:
   It will throws
   ```
   org.apache.spark.sql.AnalysisException: [UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one value column needs to be specified for UNPIVOT, all columns specified as ids;
   'Unpivot ArraySeq(id#25), ArraySeq(), variable, [value]
   +- LocalRelation <empty>, [id#25, name#26]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on PR #38973:
URL: https://github.com/apache/spark/pull/38973#issuecomment-1341969449

   Looks pretty good!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`

Posted by GitBox <gi...@apache.org>.

beliefer commented on PR #38973:
URL: https://github.com/apache/spark/pull/38973#issuecomment-1344005611

   @zhengruifeng @amaliujia Thank you !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042905152


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -824,6 +824,39 @@ def withColumn(self, colName: str, col: Column) -> "DataFrame":
             session=self._session,
         )
 
+    def melt(
+        self,
+        ids: List["ColumnOrName"],
+        values: List["ColumnOrName"],
+        variableColumnName: str,
+        valueColumnName: str,
+    ) -> "DataFrame":

Review Comment:
   We only need to match PySpark now so I think there is no need. PySpark only has the first API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042866959


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -824,6 +824,39 @@ def withColumn(self, colName: str, col: Column) -> "DataFrame":
             session=self._session,
         )
 
+    def melt(
+        self,
+        ids: List["ColumnOrName"],
+        values: List["ColumnOrName"],
+        variableColumnName: str,
+        valueColumnName: str,
+    ) -> "DataFrame":

Review Comment:
   This method is corresponding to the one of `Dataset` show below.
   ```
     def melt(
         ids: Array[Column],
         values: Array[Column],
         variableColumnName: String,
         valueColumnName: String): DataFrame =
       unpivot(ids, values, variableColumnName, valueColumnName)
   ```
   
   Should we add another method is corresponding to the another one of `Dataset` show below.
   ```
     def melt(
         ids: Array[Column],
         variableColumnName: String,
         valueColumnName: String): DataFrame =
       unpivot(ids, variableColumnName, valueColumnName)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042865399


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) Id columns.
+  repeated Expression ids = 2;

Review Comment:
   The ids in Dataset.melt is `ids: Array[Column]`, I'm not sure it's appropriate.
   @amaliujia @zhengruifeng @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042865399


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) Id columns.
+  repeated Expression ids = 2;

Review Comment:
   The ids in `Dataset.melt` is `ids: Array[Column]`, I'm not sure it's appropriate.
   @amaliujia @zhengruifeng @HyukjinKwon 



##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) Id columns.
+  repeated Expression ids = 2;

Review Comment:
   The `ids` in `Dataset.melt` is `ids: Array[Column]`, I'm not sure it's appropriate.
   @amaliujia @zhengruifeng @HyukjinKwon 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38973:
URL: https://github.com/apache/spark/pull/38973#issuecomment-1344010658

   @beliefer  BTW, would you mind sending a followup to add some end-to-end tests in `test_connect_basic.py` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042865704


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) Id columns.
+  repeated Expression ids = 2;

Review Comment:
   this is correct! we use expression instead of columns



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042904415


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,28 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformMelt(rel: proto.Melt): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    if (rel.getValuesList.isEmpty) {
+      Dataset
+        .ofRows(session, transformRelation(rel.getInput))
+        .melt(ids, rel.getVariableColumnName, rel.getValueColumnName)

Review Comment:
   What would happen if you pass a empty values? If that works then we can remove this if else?



##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,28 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformMelt(rel: proto.Melt): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    if (rel.getValuesList.isEmpty) {
+      Dataset
+        .ofRows(session, transformRelation(rel.getInput))
+        .melt(ids, rel.getVariableColumnName, rel.getValueColumnName)

Review Comment:
   What would happen if you pass a empty values? If that works then we can remove this `if else`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38973:
URL: https://github.com/apache/spark/pull/38973#issuecomment-1341900821

   awesome! @beliefer thank you so much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043098110


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -824,6 +824,39 @@ def withColumn(self, colName: str, col: Column) -> "DataFrame":
             session=self._session,
         )
 
+    def melt(

Review Comment:
   ditto: rename as `unpivot`
   
   and add one more line `melt = unpivot` somewhere



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`

Posted by GitBox <gi...@apache.org>.

beliefer commented on PR #38973:
URL: https://github.com/apache/spark/pull/38973#issuecomment-1344012544

   > @beliefer BTW, would you mind sending a followup to add some end-to-end tests in `test_connect_basic.py` ?
   
   OK


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043076491


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,28 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformMelt(rel: proto.Melt): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    if (rel.getValuesList.isEmpty) {
+      Dataset

Review Comment:
   since `melt` is just a alias for `unpivot`, and `unpivot` happens to have a [dedicated logical plan](https://github.com/apache/spark/blob/a80899f8bef84bb471606f0bbef889aa5e81628b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2072-L2085),  so please just use the plan `Unpivot` here instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043095938


##########
python/pyspark/sql/connect/plan.py:
##########
@@ -984,6 +984,67 @@ def _repr_html_(self) -> str:
         """
 
 
+class Melt(LogicalPlan):

Review Comment:
   ditto, please rename it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

beliefer commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043230936


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -824,6 +824,39 @@ def withColumn(self, colName: str, col: Column) -> "DataFrame":
             session=self._session,
         )
 
+    def melt(
+        self,
+        ids: List["ColumnOrName"],
+        values: List["ColumnOrName"],
+        variableColumnName: str,
+        valueColumnName: str,
+    ) -> "DataFrame":

Review Comment:
   I got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38973:
URL: https://github.com/apache/spark/pull/38973#issuecomment-1343999967

   merged into master, thank you @beliefer 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043076491


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,28 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformMelt(rel: proto.Melt): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    if (rel.getValuesList.isEmpty) {
+      Dataset

Review Comment:
   since `melt` is just a alias for `unpivot`, and `unpivot` has a [dedicated logical plan](https://github.com/apache/spark/blob/a80899f8bef84bb471606f0bbef889aa5e81628b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2072-L2085),  so please just use the plan `Unpivot` here instead



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043095105


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {

Review Comment:
   since `melt` == `unpivot`, and `Unpivot` is just a logical plan, so I suggest to rename it `Unpivot`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1042904146


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Required) Id columns.
+  repeated Expression ids = 2;

Review Comment:
   Yes Expression is right. Honestly I think Expression is used more often than `Column` for relational world



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [WIP][SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043095105


##########
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -570,3 +571,21 @@ message Hint {
   // (Optional) Hint parameters.
   repeated Expression.Literal parameters = 3;
 }
+
+// Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
+message Melt {

Review Comment:
   since `melt` == `unpivot`, and `Unpivot` is just a logical plan, so I suggest renaming it as `Unpivot`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38973:
URL: https://github.com/apache/spark/pull/38973#discussion_r1043999011


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -309,6 +310,24 @@ class SparkConnectPlanner(session: SparkSession) {
     UnresolvedHint(rel.getName, params, transformRelation(rel.getInput))
   }
 
+  private def transformUnpivot(rel: proto.Unpivot): LogicalPlan = {
+    val ids = rel.getIdsList.asScala.toArray.map { expr =>
+      Column(transformExpression(expr))
+    }
+
+    val df = Dataset.ofRows(session, transformRelation(rel.getInput))
+
+    if (rel.getValuesList.isEmpty) {
+      df.unpivot(ids, rel.getVariableColumnName, rel.getValueColumnName).logicalPlan
+    } else {
+      val values = rel.getValuesList.asScala.toArray.map { expr =>
+        Column(transformExpression(expr))
+      }
+
+      df.unpivot(ids, values, rel.getVariableColumnName, rel.getValueColumnName).logicalPlan

Review Comment:
   one more thing:
   please directly use the Unpivot Expression instead of invoking the `Dataframe.unpivot`
   ```
       Unpivot(
         Some(ids.map(_.named)),
         Some(values.map(v => Seq(v.named))),
         None,
         variableColumnName,
         Seq(valueColumnName),
         logicalPlan
       )
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`

Posted by GitBox <gi...@apache.org>.

zhengruifeng closed pull request #38973: [SPARK-41439][CONNECT][PYTHON] Implement `DataFrame.melt` and `DataFrame.unpivot`
URL: https://github.com/apache/spark/pull/38973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org