You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/14 08:45:07 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

zhengruifeng opened a new pull request, #38653:
URL: https://github.com/apache/spark/pull/38653

   ### What changes were proposed in this pull request?
   Implement `DataFrame.fillna ` and `DataFrame.na.fill`
   
   
   ### Why are the changes needed?
   For API coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new methods
   
   
   ### How was this patch tested?
   added UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022273256


##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -70,6 +70,37 @@ def test_filter(self):
         self.assertEqual(plan.root.filter.condition.unresolved_function.parts, [">"])
         self.assertEqual(len(plan.root.filter.condition.unresolved_function.arguments), 2)
 
+    def test_fill_na(self):
+        df = self.connect.readTable(table_name=self.tbl_name)
+
+        plan = df.fillna(value=1)._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].long_value, 1)
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz")._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz", subset=["col_a", "col_b"])._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])
+
+        plan = df.na.fill(value=True, subset=("col_a", "col_b", "col_c"))._plan.to_proto(
+            self.connect
+        )
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].bool_value, True)
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b", "col_c"])
+
+        plan = df.fillna({"col_a": 1.5, "col_b": "abc"})._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 2)
+        self.assertEqual(plan.root.fill_na.values[0].double_value, 1.5)
+        self.assertEqual(plan.root.fill_na.values[1].string_value, "abc")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])

Review Comment:
   Yeah I am actually thinking we maybe needs at least one for each API to make sure they pass through. Then use such plan_only test case to improve the coverage for different usage/cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022295525


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   Could we probably use `Literal` to ship these values? We can call `eval` to take this out in JVM side.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022272524


##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -70,6 +70,37 @@ def test_filter(self):
         self.assertEqual(plan.root.filter.condition.unresolved_function.parts, [">"])
         self.assertEqual(len(plan.root.filter.condition.unresolved_function.arguments), 2)
 
+    def test_fill_na(self):
+        df = self.connect.readTable(table_name=self.tbl_name)
+
+        plan = df.fillna(value=1)._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].long_value, 1)
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz")._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz", subset=["col_a", "col_b"])._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])
+
+        plan = df.na.fill(value=True, subset=("col_a", "col_b", "col_c"))._plan.to_proto(
+            self.connect
+        )
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].bool_value, True)
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b", "col_c"])
+
+        plan = df.fillna({"col_a": 1.5, "col_b": "abc"})._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 2)
+        self.assertEqual(plan.root.fill_na.values[0].double_value, 1.5)
+        self.assertEqual(plan.root.fill_na.values[1].string_value, "abc")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])

Review Comment:
   we don't have e2e test for most methods, but let me add one for this method



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022344761


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022538668


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -226,6 +226,58 @@ package object dsl {
       }
     }
 
+    implicit class DslNAFunctions(val logicalPlan: Relation) {

Review Comment:
   no, not needed, but `implicit` is already widely used in both catalyst and connect...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

amaliujia commented on PR #38653:
URL: https://github.com/apache/spark/pull/38653#issuecomment-1314771256

   LGTM


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022249117


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated ValueType values = 3;
+
+  // Available data types.
+  message ValueType {

Review Comment:
   Is this `value` or `type` which probably different things? If it is just value then just call it value (same for type).



##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -70,6 +70,37 @@ def test_filter(self):
         self.assertEqual(plan.root.filter.condition.unresolved_function.parts, [">"])
         self.assertEqual(len(plan.root.filter.condition.unresolved_function.arguments), 2)
 
+    def test_fill_na(self):
+        df = self.connect.readTable(table_name=self.tbl_name)
+
+        plan = df.fillna(value=1)._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].long_value, 1)
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz")._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz", subset=["col_a", "col_b"])._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])
+
+        plan = df.na.fill(value=True, subset=("col_a", "col_b", "col_c"))._plan.to_proto(
+            self.connect
+        )
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].bool_value, True)
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b", "col_c"])
+
+        plan = df.fillna({"col_a": 1.5, "col_b": "abc"})._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 2)
+        self.assertEqual(plan.root.fill_na.values[0].double_value, 1.5)
+        self.assertEqual(plan.root.fill_na.values[1].string_value, "abc")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])

Review Comment:
   Do we need at least one e2e test or such test suite is enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022271292


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated ValueType values = 3;
+
+  // Available data types.
+  message ValueType {

Review Comment:
   yes, should be `type`s here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022527827


##########
connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala:
##########
@@ -226,6 +226,58 @@ package object dsl {
       }
     }
 
+    implicit class DslNAFunctions(val logicalPlan: Relation) {

Review Comment:
   does it need to be implicit?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022320039


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   @cloud-fan @HyukjinKwon make sense, let me update it 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022297166


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   I think this is the first method that only support several datatype.
   
   > Could we probably use Literal to ship these values?
   
   the question is do we need to limit the datatypes in proto message? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022318227


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   @HyukjinKwon thank you. There are also some other methods only supporting a subset of datetypes. If we prefer use `Literal` then I can make such change in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng closed pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `
URL: https://github.com/apache/spark/pull/38653


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38653:
URL: https://github.com/apache/spark/pull/38653#issuecomment-1315103593

   merged into master, thank you guys


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

amaliujia commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022273256


##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -70,6 +70,37 @@ def test_filter(self):
         self.assertEqual(plan.root.filter.condition.unresolved_function.parts, [">"])
         self.assertEqual(len(plan.root.filter.condition.unresolved_function.arguments), 2)
 
+    def test_fill_na(self):
+        df = self.connect.readTable(table_name=self.tbl_name)
+
+        plan = df.fillna(value=1)._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].long_value, 1)
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz")._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz", subset=["col_a", "col_b"])._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])
+
+        plan = df.na.fill(value=True, subset=("col_a", "col_b", "col_c"))._plan.to_proto(
+            self.connect
+        )
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].bool_value, True)
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b", "col_c"])
+
+        plan = df.fillna({"col_a": 1.5, "col_b": "abc"})._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 2)
+        self.assertEqual(plan.root.fill_na.values[0].double_value, 1.5)
+        self.assertEqual(plan.root.fill_na.values[1].string_value, "abc")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])

Review Comment:
   Yeah I am actually we maybe needs at least one for each API to make sure they pass through.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

zhengruifeng commented on PR #38653:
URL: https://github.com/apache/spark/pull/38653#issuecomment-1314630040

   cc @HyukjinKwon @cloud-fan @amaliujia @grundprinzip 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022318950


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   It seems more flexible to restrict the types in the server side. Once we relax the restriction in the future, people don't need to update the clients.



##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   It seems more flexible to restrict the types in the server side. Once we relax the restriction in the future, people don't need to upgrade the clients.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022317075


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   Okay, I am fine with the current way for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022295741


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated Type values = 3;
+
+  // Available data types.
+  message Type {

Review Comment:
   Or is this consistent with other API we added so far? Then I am good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org