You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/15 02:21:38 UTC

[GitHub] [spark] amaliujia commented on a diff in pull request #38653: [SPARK-41128][CONNECT][PYTHON] Implement `DataFrame.fillna ` and `DataFrame.na.fill `

amaliujia commented on code in PR #38653:
URL: https://github.com/apache/spark/pull/38653#discussion_r1022249117


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -316,6 +319,36 @@ message StatCrosstab {
   string col2 = 3;
 }
 
+// Replaces null values.
+// It will invoke 'Dataset.na.fill' (same as 'DataFrameNaFunctions.fill') to compute the results.
+// Following 3 parameter combinations are supported:
+//  1, 'values' only contains 1 item, 'cols' is empty:
+//    replaces null values in all type-matched columns.
+//  2, 'values' only contains 1 item, 'cols' is not empty:
+//    replaces null values in specified columns.
+//  3, 'values' contains more than 1 items, then 'cols' is required to have the same length:
+//    replaces each specified column with corresponding value.
+message NAFill {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) Optional list of column names to consider.
+  repeated string cols = 2;
+
+  // (Required) Values to replace null values with. Should contains at least 1 item.
+  repeated ValueType values = 3;
+
+  // Available data types.
+  message ValueType {

Review Comment:
   Is this `value` or `type` which probably different things? If it is just value then just call it value (same for type).



##########
python/pyspark/sql/tests/connect/test_connect_plan_only.py:
##########
@@ -70,6 +70,37 @@ def test_filter(self):
         self.assertEqual(plan.root.filter.condition.unresolved_function.parts, [">"])
         self.assertEqual(len(plan.root.filter.condition.unresolved_function.arguments), 2)
 
+    def test_fill_na(self):
+        df = self.connect.readTable(table_name=self.tbl_name)
+
+        plan = df.fillna(value=1)._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].long_value, 1)
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz")._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, [])
+
+        plan = df.na.fill(value="xyz", subset=["col_a", "col_b"])._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].string_value, "xyz")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])
+
+        plan = df.na.fill(value=True, subset=("col_a", "col_b", "col_c"))._plan.to_proto(
+            self.connect
+        )
+        self.assertEqual(len(plan.root.fill_na.values), 1)
+        self.assertEqual(plan.root.fill_na.values[0].bool_value, True)
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b", "col_c"])
+
+        plan = df.fillna({"col_a": 1.5, "col_b": "abc"})._plan.to_proto(self.connect)
+        self.assertEqual(len(plan.root.fill_na.values), 2)
+        self.assertEqual(plan.root.fill_na.values[0].double_value, 1.5)
+        self.assertEqual(plan.root.fill_na.values[1].string_value, "abc")
+        self.assertEqual(plan.root.fill_na.cols, ["col_a", "col_b"])

Review Comment:
   Do we need at least one e2e test or such test suite is enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org