You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/27 03:05:46 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

zhengruifeng opened a new pull request, #38318:
URL: https://github.com/apache/spark/pull/38318

   ### What changes were proposed in this pull request?
    Implement `DataFrame.summary`
   
   there is a set of DataFrame APIs implemented in [`StatFunctions`](https://github.com/apache/spark/blob/9cae423075145d3dd81d53f4b82d4f2af6fe7c15/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala),  [`DataFrameStatFunctions`](https://github.com/apache/spark/blob/b69c26833c99337bb17922f21dd72ee3a12e0c0a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala) and [`DataFrameNaFunctions`](https://github.com/apache/spark/blob/5d74ace648422e7a9bff7774ac266372934023b9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameNaFunctions.scala), which I think can not be implemented in connect client due to:
   
   1. depend on Catalyst's analysis (most of them);
   2. implemented in RDD operations (like `summary`,`approxQuantile`);
   3. internally trigger jobs (like `summary`);
   
   This PR introduced a new proto `DataFrameFunction`  to support arbitary `DataFrame => DataFrame` method
   
   ### Why are the changes needed?
   for Connect API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new API
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`
URL: https://github.com/apache/spark/pull/38318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006439147


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   @cloud-fan the old `df.summary` eagerly compute the statistics and always return a `LocalRelation`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016143967


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.planner
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.connect.command.SparkConnectCommandPlanner
+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite

Review Comment:
   i encounter some problems in adding test in existing suites since summary requires a session and need to analyze the plan;
   this suite will also cover some eargly computed stat functions (`cov` `corr`) in the future



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006408546


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   what do you mean by `truncate the SQL plans`? DataFrame transformations just accumulate the logical plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1001496337


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   This is fine for now but it's going to truncate the SQL plans that disable further optimization. We should probably add dedicated plans for `def summary` in `Dataset` itself.
   
   For now, LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1015762416


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
     def where(self, condition: Expression) -> "DataFrame":
         return self.filter(condition)
 
+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)
+        assert all(isinstance(s, str) for s in _statistics)

Review Comment:
   In contrast, I guess the assert in `plan.py` is ok because that is internal API and we can implement it right so assert on unexpected things (and developer can fix when it really happens).
   ```
       def __init__(self, child: Optional["LogicalPlan"], function: str, **kwargs: Any) -> None:
           super().__init__(child)
           assert function in ["summary"]
           self.function = function
     ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON][WIP] Implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1285263225

   there is a set of DataFrame APIs implemented in [`StatFunctions`](https://github.com/apache/spark/blob/9cae423075145d3dd81d53f4b82d4f2af6fe7c15/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala) and  [`DataFrameStatFunctions`](https://github.com/apache/spark/blob/b69c26833c99337bb17922f21dd72ee3a12e0c0a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala), which I think can not be implemented in connect client due to:
   
   1. depend on Catalyst's analysis;
   2. implemented in RDD operations;
   3. internally trigger jobs (like `summary`);
   
   this PR is to find a way to support arbitary `DataFrame => DataFrame` method


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1017258828


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -246,3 +248,20 @@ message SubqueryAlias {
   // Optional. Qualifier of the alias.
   repeated string qualifier = 3;
 }
+
+// StatFunction
+message StatFunction {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The function and its parameters.
+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;

Review Comment:
   yes, such as `crosstab` `cov` `corr` etc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1308066292

   thanks @cloud-fan @HyukjinKwon @amaliujia for reviews!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1015041619


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
     def where(self, condition: Expression) -> "DataFrame":
         return self.filter(condition)
 
+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)

Review Comment:
   different from https://github.com/apache/spark/blob/29e45528315c43336ba60ee1717ef5ec0c001c00/python/pyspark/sql/dataframe.py#L2575-L2576 since i think that preprocessing weird



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016204669


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.planner
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.connect.command.SparkConnectCommandPlanner
+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite

Review Comment:
   removed, add another test in existing suites



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006680046


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   yes, it has been resolved 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016144747


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -246,3 +248,20 @@ message SubqueryAlias {
   // Optional. Qualifier of the alias.
   repeated string qualifier = 3;
 }
+
+// StatFunction
+message StatFunction {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The function and its parameters.
+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;

Review Comment:
   here follows https://github.com/apache/spark/blob/master/connector/connect/src/main/protobuf/spark/connect/relations.proto#L51 to catch unexpected input



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016146224


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
     def where(self, condition: Expression) -> "DataFrame":
         return self.filter(condition)
 
+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)
+        assert all(isinstance(s, str) for s in _statistics)

Review Comment:
   agree, should give the error message, will update



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006522847


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   Oh that's an issue. Can it be solved by updating `df.summary` implementation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1001680211


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   yes, then it will have more optimization space. let us add new plan for it. thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016152368


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.planner
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.connect.command.SparkConnectCommandPlanner
+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite

Review Comment:
   let me take another look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016063920


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.planner
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.connect.command.SparkConnectCommandPlanner
+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite

Review Comment:
   why do we need a new test suite?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006680372


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   In the new impl



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006379858


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   cc @cloud-fan @HyukjinKwon 
   since we had reimplemented the `df.summary` https://github.com/apache/spark/commit/6a0713a141fa98d83029d8388508cbbc40fd554e, are there some differences in sql optimization between this method (directly invoke `df.summary`) and adding a dedicated plan?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1015762416


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
     def where(self, condition: Expression) -> "DataFrame":
         return self.filter(condition)
 
+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)
+        assert all(isinstance(s, str) for s in _statistics)

Review Comment:
   In contrast, I guess the assert in `plan.py` is ok because that is internal API and we can implement it right so assert on unexpected things.
   ```
       def __init__(self, child: Optional["LogicalPlan"], function: str, **kwargs: Any) -> None:
           super().__init__(child)
           assert function in ["summary"]
           self.function = function
     ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016062984


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -246,3 +248,20 @@ message SubqueryAlias {
   // Optional. Qualifier of the alias.
   repeated string qualifier = 3;
 }
+
+// StatFunction
+message StatFunction {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The function and its parameters.
+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;

Review Comment:
   why do we need this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1017139635


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.planner
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.connect.command.SparkConnectCommandPlanner
+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite

Review Comment:
   Initially I thought a separate suite was ok because that suite actually just executed the proto plan to check expected result. 
   
   Now it's changed to plan comparison based test, which is ok. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1286538498

   cc @HyukjinKwon @cloud-fan @amaliujia @grundprinzip  PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1006401431


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   Some rules may not work as they don't recognize the new plan.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1016674697


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -246,3 +248,20 @@ message SubqueryAlias {
   // Optional. Qualifier of the alias.
   repeated string qualifier = 3;
 }
+
+// StatFunction
+message StatFunction {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The function and its parameters.
+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;

Review Comment:
   I think that's for enum but here is an optional field... cc @amaliujia 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1305136349

   cc @cloud-fan @HyukjinKwon @amaliujia 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1017296514


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -246,3 +248,20 @@ message SubqueryAlias {
   // Optional. Qualifier of the alias.
   repeated string qualifier = 3;
 }
+
+// StatFunction
+message StatFunction {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The function and its parameters.
+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;

Review Comment:
   ok then this makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1002094592


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -93,6 +96,19 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) {
       transformRelation(rel.getInput))
   }
 
+  private def transformDataFrameFunction(rel: proto.DataFrameFunction): LogicalPlan = {
+    val child = transformRelation(rel.getInput)
+
+    rel.getFunctionCase match {
+      case proto.DataFrameFunction.FunctionCase.SUMMARY =>
+        StatFunctions
+          .summary(Dataset.ofRows(session, child), rel.getSummary.getStatisticsList.asScala.toSeq)

Review Comment:
   +1! 
   
   I don't know how to add a new plan. It would be very useful to have a PR as an example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng closed pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `DataFrameFunction` in proto and implement `DataFrame.summary`
URL: https://github.com/apache/spark/pull/38318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1015761457


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
     def where(self, condition: Expression) -> "DataFrame":
         return self.filter(condition)
 
+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)
+        assert all(isinstance(s, str) for s in _statistics)

Review Comment:
   Given `def summary(self, *statistics: str) -> "DataFrame":` is public API so there could be some mis-use passing into non-str parameters for `statistics`, is this common practice to just assert without giving a message?



##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -323,6 +323,14 @@ def unionByName(self, other: "DataFrame", allowMissingColumns: bool = False) ->
     def where(self, condition: Expression) -> "DataFrame":
         return self.filter(condition)
 
+    def summary(self, *statistics: str) -> "DataFrame":
+        _statistics: List[str] = list(statistics)
+        assert all(isinstance(s, str) for s in _statistics)

Review Comment:
   In contrast, I guess the assert in `plan.py` is ok because that is internal API and we can implement it right
   ```
       def __init__(self, child: Optional["LogicalPlan"], function: str, **kwargs: Any) -> None:
           super().__init__(child)
           assert function in ["summary"]
           self.function = function
     ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1017287961


##########
connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectStatFunctionSuite.scala:
##########
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connect.planner
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.connect.command.SparkConnectCommandPlanner
+import org.apache.spark.sql.connect.dsl.commands._
+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+
+class SparkConnectStatFunctionSuite

Review Comment:
   tests were improved a lot since the time i sent this pr :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38318:
URL: https://github.com/apache/spark/pull/38318#discussion_r1017138563


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -246,3 +248,20 @@ message SubqueryAlias {
   // Optional. Qualifier of the alias.
   repeated string qualifier = 3;
 }
+
+// StatFunction
+message StatFunction {
+  // Required. The input relation.
+  Relation input = 1;
+  // Required. The function and its parameters.
+  oneof function {
+    Summary summary = 2;
+
+    Unknown unknown = 999;

Review Comment:
   Question: will we add new functions under this `oneof`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38318: [SPARK-40852][CONNECT][PYTHON] Introduce `StatFunction` in proto and implement `DataFrame.summary`

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on PR #38318:
URL: https://github.com/apache/spark/pull/38318#issuecomment-1308063610

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org