You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/29 08:54:43 UTC

[GitHub] [spark] zhengruifeng opened a new pull request, #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

zhengruifeng opened a new pull request, #38836:
URL: https://github.com/apache/spark/pull/38836

   ### What changes were proposed in this pull request?
   Implement `DataFrame.replace` and `DataFrame.na.replace`
   
   
   ### Why are the changes needed?
   for api coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, new api
   
   
   ### How was this patch tested?
   added ut


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`
URL: https://github.com/apache/spark/pull/38836


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035501049


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -466,6 +467,35 @@ message NADrop {
 }
 
 
+// Replaces old values with the corresponding values.
+// It will invoke 'Dataset.na.replace' (same as 'DataFrameNaFunctions.replace')
+// to compute the results.
+message NAReplace {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) List of column names to consider.
+  //
+  // When it is empty, all the type-compatible columns in the input relation will be considered.
+  repeated string cols = 2;
+
+  // (Optional) The value replacement mapping.
+  repeated Replacement replacements = 3;
+
+  message Replacement {
+    // (Required) The old value.
+    //
+    // Only 4 data types are supported now: null, bool, double, string.
+    Expression.Literal old_value = 1;
+
+    // (Required) The new value.
+    //
+    // Should be of the same data type with the old value.

Review Comment:
   Nit: probably we don't document it now, but when reading this, my immediate next question is what if two data type are different, what is expected behavior? 
   
   We might need to update entire documentation for those exceptions when it is a right time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035501362


##########
python/pyspark/sql/connect/dataframe.py:
##########
@@ -1050,6 +1055,137 @@ def dropna(
             session=self._session,
         )
 
+    def replace(
+        self,
+        to_replace: Union[
+            "LiteralType", List["LiteralType"], Dict["LiteralType", "OptionalPrimitiveType"]
+        ],
+        value: Optional[
+            Union["OptionalPrimitiveType", List["OptionalPrimitiveType"], _NoValueType]
+        ] = _NoValue,
+        subset: Optional[List[str]] = None,
+    ) -> "DataFrame":
+        """Returns a new :class:`DataFrame` replacing a value with another value.
+        :func:`DataFrame.replace` and :func:`DataFrameNaFunctions.replace` are
+        aliases of each other.
+        Values to_replace and value must have the same type and can only be numerics, booleans,
+        or strings. Value can have None. When replacing, the new value will be cast
+        to the type of the existing column.
+        For numeric replacements all values to be replaced should have unique
+        floating point representation. In case of conflicts (for example with `{42: -1, 42.0: 1}`)
+        and arbitrary replacement will be used.
+
+        .. versionadded:: 3.4.0

Review Comment:
   @HyukjinKwon made a good point that given entire Connect is added since 3.4, we probably do not need repeat adding this on every API but just have a package level `versionadded`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035558780


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -466,6 +467,35 @@ message NADrop {
 }
 
 
+// Replaces old values with the corresponding values.
+// It will invoke 'Dataset.na.replace' (same as 'DataFrameNaFunctions.replace')
+// to compute the results.
+message NAReplace {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) List of column names to consider.
+  //
+  // When it is empty, all the type-compatible columns in the input relation will be considered.
+  repeated string cols = 2;
+
+  // (Optional) The value replacement mapping.
+  repeated Replacement replacements = 3;
+
+  message Replacement {
+    // (Required) The old value.
+    //
+    // Only 4 data types are supported now: null, bool, double, string.
+    Expression.Literal old_value = 1;

Review Comment:
   ah I see



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035500538


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -466,6 +467,35 @@ message NADrop {
 }
 
 
+// Replaces old values with the corresponding values.
+// It will invoke 'Dataset.na.replace' (same as 'DataFrameNaFunctions.replace')
+// to compute the results.
+message NAReplace {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) List of column names to consider.
+  //
+  // When it is empty, all the type-compatible columns in the input relation will be considered.
+  repeated string cols = 2;
+
+  // (Optional) The value replacement mapping.
+  repeated Replacement replacements = 3;
+
+  message Replacement {
+    // (Required) The old value.
+    //
+    // Only 4 data types are supported now: null, bool, double, string.
+    Expression.Literal old_value = 1;

Review Comment:
   Question:
   
   How does proto now encode a `NULL`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38836:
URL: https://github.com/apache/spark/pull/38836#issuecomment-1331755108

   merged into master, thank you guys for reivews


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035477052


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -553,6 +553,42 @@ def test_drop_na(self):
             self.spark.sql(query).na.drop(how="any", thresh=2, subset="a").toPandas(),
         )
 
+    def test_replace(self):

Review Comment:
   will add



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on PR #38836:
URL: https://github.com/apache/spark/pull/38836#issuecomment-1331606243

   cc @HyukjinKwon @amaliujia @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035563905


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -466,6 +467,35 @@ message NADrop {
 }
 
 
+// Replaces old values with the corresponding values.
+// It will invoke 'Dataset.na.replace' (same as 'DataFrameNaFunctions.replace')
+// to compute the results.
+message NAReplace {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) List of column names to consider.
+  //
+  // When it is empty, all the type-compatible columns in the input relation will be considered.
+  repeated string cols = 2;
+
+  // (Optional) The value replacement mapping.
+  repeated Replacement replacements = 3;
+
+  message Replacement {
+    // (Required) The old value.
+    //
+    // Only 4 data types are supported now: null, bool, double, string.
+    Expression.Literal old_value = 1;
+
+    // (Required) The new value.
+    //
+    // Should be of the same data type with the old value.

Review Comment:
   agree to do the document work it a bit later, this doc is just copied from scala api.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] grundprinzip commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
grundprinzip commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1034985163


##########
python/pyspark/sql/tests/connect/test_connect_basic.py:
##########
@@ -553,6 +553,42 @@ def test_drop_na(self):
             self.spark.sql(query).na.drop(how="any", thresh=2, subset="a").toPandas(),
         )
 
+    def test_replace(self):

Review Comment:
   * can you add some negative tests for assertions?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on code in PR #38836:
URL: https://github.com/apache/spark/pull/38836#discussion_r1035515918


##########
connector/connect/src/main/protobuf/spark/connect/relations.proto:
##########
@@ -466,6 +467,35 @@ message NADrop {
 }
 
 
+// Replaces old values with the corresponding values.
+// It will invoke 'Dataset.na.replace' (same as 'DataFrameNaFunctions.replace')
+// to compute the results.
+message NAReplace {
+  // (Required) The input relation.
+  Relation input = 1;
+
+  // (Optional) List of column names to consider.
+  //
+  // When it is empty, all the type-compatible columns in the input relation will be considered.
+  repeated string cols = 2;
+
+  // (Optional) The value replacement mapping.
+  repeated Replacement replacements = 3;
+
+  message Replacement {
+    // (Required) The old value.
+    //
+    // Only 4 data types are supported now: null, bool, double, string.
+    Expression.Literal old_value = 1;

Review Comment:
   here https://github.com/apache/spark/blob/4c35c5bb5e545acb2f46a80218f68e69c868b388/connector/connect/src/main/protobuf/spark/connect/expressions.proto#L43
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] amaliujia commented on pull request #38836: [SPARK-41315][CONNECT][PYTHON] Implement `DataFrame.replace` and `DataFrame.na.replace`

Posted by GitBox <gi...@apache.org>.
amaliujia commented on PR #38836:
URL: https://github.com/apache/spark/pull/38836#issuecomment-1331759320

   Late LGTM!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org