You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/27 07:47:03 UTC

[GitHub] [spark] beliefer opened a new pull request, #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

beliefer opened a new pull request, #39236:
URL: https://github.com/apache/spark/pull/39236

   ### What changes were proposed in this pull request?
   Implement `DataFrame.stat.corr` with a proto message
   
   Implement `DataFrame.stat.corr` for scala API
   Implement `DataFrame.stat.corr` for python API
   
   
   ### Why are the changes needed?
   for Connect API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'. New API
   
   
   ### How was this patch tested?
   New test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
beliefer commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1058721777


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -352,6 +353,16 @@ class SparkConnectPlanner(session: SparkSession) {
       data = Tuple1.apply(cov) :: Nil)
   }
 
+  private def transformStatCorr(rel: proto.StatCorr): LogicalPlan = {
+    val corr = Dataset
+      .ofRows(session, transformRelation(rel.getInput))
+      .stat
+      .corr(rel.getCol1, rel.getCol2, rel.getMethod)

Review Comment:
   The default value is `pearson`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
beliefer commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1057591817


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -970,6 +972,11 @@ def test_show(self):
         expected = "+---+---+\n|  X|  Y|\n+---+---+\n|  1|  2|\n+---+---+\n"
         self.assertEqual(show_str, expected)
 
+    def test_stat_corr(self):

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`
URL: https://github.com/apache/spark/pull/39236


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #39236:
URL: https://github.com/apache/spark/pull/39236#issuecomment-1367732111

   merge into master, thank you @beliefer for working on this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1058719947


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -352,6 +353,16 @@ class SparkConnectPlanner(session: SparkSession) {
       data = Tuple1.apply(cov) :: Nil)
   }
 
+  private def transformStatCorr(rel: proto.StatCorr): LogicalPlan = {
+    val corr = Dataset
+      .ofRows(session, transformRelation(rel.getInput))
+      .stat
+      .corr(rel.getCol1, rel.getCol2, rel.getMethod)

Review Comment:
   what if the optional `method` is not set ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #39236:
URL: https://github.com/apache/spark/pull/39236#issuecomment-1367052882

   ping @HyukjinKwon @zhengruifeng @grundprinzip @amaliujia


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1058734162


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -352,6 +353,16 @@ class SparkConnectPlanner(session: SparkSession) {
       data = Tuple1.apply(cov) :: Nil)
   }
 
+  private def transformStatCorr(rel: proto.StatCorr): LogicalPlan = {
+    val corr = Dataset
+      .ofRows(session, transformRelation(rel.getInput))
+      .stat
+      .corr(rel.getCol1, rel.getCol2, rel.getMethod)

Review Comment:
   in this server side we need to check `rel.hasMethod` since it is optional.
   
   other clients can not set `pearson` according to the definition of this proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
beliefer commented on PR #39236:
URL: https://github.com/apache/spark/pull/39236#issuecomment-1367742111

   @zhengruifeng @HyukjinKwon Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1057527990


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -970,6 +972,11 @@ def test_show(self):
         expected = "+---+---+\n|  X|  Y|\n+---+---+\n|  1|  2|\n+---+---+\n"
         self.assertEqual(show_str, expected)
 
+    def test_stat_corr(self):

Review Comment:
   what about create a df with `spark.sql(...)` and compare the results between PySpark and Connnect?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1058720184


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -352,6 +353,16 @@ class SparkConnectPlanner(session: SparkSession) {
       data = Tuple1.apply(cov) :: Nil)
   }
 
+  private def transformStatCorr(rel: proto.StatCorr): LogicalPlan = {
+    val corr = Dataset
+      .ofRows(session, transformRelation(rel.getInput))
+      .stat
+      .corr(rel.getCol1, rel.getCol2, rel.getMethod)

Review Comment:
   I think we need to check `rel.hasMethod`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] beliefer commented on a diff in pull request #39236: [SPARK-41068][CONNECT][PYTHON] Implement `DataFrame.stat.corr`

Posted by GitBox <gi...@apache.org>.
beliefer commented on code in PR #39236:
URL: https://github.com/apache/spark/pull/39236#discussion_r1058876958


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -352,6 +353,16 @@ class SparkConnectPlanner(session: SparkSession) {
       data = Tuple1.apply(cov) :: Nil)
   }
 
+  private def transformStatCorr(rel: proto.StatCorr): LogicalPlan = {
+    val corr = Dataset
+      .ofRows(session, transformRelation(rel.getInput))
+      .stat
+      .corr(rel.getCol1, rel.getCol2, rel.getMethod)

Review Comment:
   I got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org