You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/07 07:42:44 UTC

[GitHub] [spark] xinrong-databricks opened a new pull request #35747: Support string and bool `regex` in `Series.replace`

xinrong-databricks opened a new pull request #35747:
URL: https://github.com/apache/spark/pull/35747


   ### What changes were proposed in this pull request?
   Support string and bool `regex` in Series.replace
   
   
   ### Why are the changes needed?
   To reach parity with pandas API.
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. String and bool `regex` are supported for `Series.replace`.
   ```py
   # bool `regex`
           >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
           >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
           0     new
           1     foo
           2    bait
           3     abc
           4     new
           5     zoo
           dtype: object
   
   # string `regex`
           >>> psser.replace(value='new', regex=r'^.oo$')
           0     bat
           1     foo
           2    bait
           3     abc
           4     bar
           5     zoo
           dtype: object
   ```
   
   ### How was this patch tested?
   Unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1069795444


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1063185819


   I think the examples will be easier to understand if you remove the bait and replace that word with someone who does not start on the ba


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
bjornjorgensen commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1064479400


   Yes, I think it will be easier for users and others to understand what the function does if you change the regex to change `bait` as well as the regex to change `bat` and `bar`. After reading the PR, I was left with the question of why `bait` was also not changed.
   If you change the regex so that the result looks like this
   
   `
   >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
   0     new
   1     foo
   2     new
   3     abc
   4     new
   5     zoo
   `
   
   then I think it will be easier to understand.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r822407993



##########
File path: python/pyspark/pandas/series.py
##########
@@ -4283,17 +4283,20 @@ def keys(self) -> "ps.Index":
         """
         return self.index
 
-    # TODO: 'regex', 'method' parameter
+    # TODO: introduce 'method', 'limit', 'in_place'; fully support 'regex'
     def replace(
         self,
         to_replace: Optional[Union[Any, List, Tuple, Dict]] = None,
         value: Optional[Union[List, Tuple]] = None,
-        regex: bool = False,
+        regex: Union[str, bool] = False,
     ) -> "Series":
         """
         Replace values given in to_replace with value.
         Values of the Series are replaced with other values dynamically.
 
+        .. note:: The API supports pattern matching (and replacement) on the whole string only,

Review comment:
       > ```python
   > ```python
   > >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   > >>> psser.replace('ba', 'xx', regex=True)
   > 0     xx
   > 1    foo
   > 2     xx
   > 3    abc
   > 4     xx
   > 5    zoo
   > dtype: object
   > ```
   > 
   > 
   >     
   >       
   >     
   > 
   >       
   >     
   > 
   >     
   >   
   > vs.
   > ```python
   > >>> pser = pd.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   > >>> pser.replace('ba', 'xx', regex=True)
   > 0     xxt
   > 1     foo
   > 2    xxit
   > 3     abc
   > 4     xxr
   > 5     zoo
   > dtype: object
   > ```
   > ```
   
   How about we also add this example to the note ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1064637020


   The result for the example you give should be 
   ```py
   >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
   0     new
   1     foo
   2    bait  # bait is not replaced since it doesn't pattern match `'^ba.$'`
   3     abc
   4     new
   5     zoo
   dtype: object
   ```
   
   The example in the existing PR `psser.replace('ba', 'xx', regex=True)` is to show even `ba` partially match `bait`, the replacement `xx` is applied to the whole string `bait`, i.e. `bait` is replaced to `xx`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r826429339



##########
File path: python/pyspark/pandas/series.py
##########
@@ -4307,17 +4307,20 @@ def keys(self) -> "ps.Index":
         """
         return self.index
 
-    # TODO: 'regex', 'method' parameter
+    # TODO: introduce 'method', 'limit', 'in_place'; fully support 'regex'
     def replace(
         self,
         to_replace: Optional[Union[Any, List, Tuple, Dict]] = None,
         value: Optional[Union[List, Tuple]] = None,
-        regex: bool = False,
+        regex: Union[str, bool] = False,
     ) -> "Series":
         """
         Replace values given in to_replace with value.
         Values of the Series are replaced with other values dynamically.
 
+        .. note:: For partial pattern matching, the replacement is against the whole string,
+            which is different from pandas'.

Review comment:
       Can we a bit more elaborate why our behavior is different from pandas'?

##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -1735,15 +1735,32 @@ def test_replace(self):
         self.assert_eq(psser.replace([10, 15], [45, 50]), pser.replace([10, 15], [45, 50]))
         self.assert_eq(psser.replace((10, 15), (45, 50)), pser.replace((10, 15), (45, 50)))
 
+        pser = pd.Series(["bat", "foo", "bait", "abc", "bar", "zoo"])
+        psser = ps.from_pandas(pser)
+        self.assert_eq(
+            psser.replace(to_replace=r"^ba.$", value="new", regex=True),
+            pser.replace(to_replace=r"^ba.$", value="new", regex=True),
+        )
+        self.assert_eq(
+            psser.replace(regex=r"^.oo$", value="new"), pser.replace(regex=r"^.oo$", value="new")
+        )
+
         msg = "'to_replace' should be one of str, list, tuple, dict, int, float"
         with self.assertRaisesRegex(TypeError, msg):
             psser.replace(ps.range(5))
         msg = "Replacement lists must match in length. Expecting 3 got 2"
         with self.assertRaisesRegex(ValueError, msg):
-            psser.replace([10, 20, 30], [1, 2])
-        msg = "replace currently not support for regex"
+            psser.replace(["bat", "foo", "bait"], ["a", "b"])
+        msg = "'to_replace' must be 'None' if 'regex' is not a bool"
+        with self.assertRaisesRegex(ValueError, msg):
+            psser.replace(to_replace="foo", regex=r"^.oo$")
+        msg = "If 'regex' is True then 'to_replace' must be a string"
+        with self.assertRaisesRegex(AssertionError, msg):
+            psser.replace(["bat", "foo", "bait"], regex=True)
+        unsupported_regex = [r"^.oo$", r"^ba.$"]
+        msg = "'regex' of %s type is not supported" % type(unsupported_regex).__name__
         with self.assertRaisesRegex(NotImplementedError, msg):
-            psser.replace(r"^1.$", regex=True)
+            psser.replace(regex=unsupported_regex, value="new")

Review comment:
       Can we add test case for chained operation?
   
   e.g.
   `(pser + "A").replace(to_replace=r"^ba.$", value="new", regex=True)`
   or
   `pser.replace(to_replace=r"^ba.$", value="new", regex=True).replace(to_replace="new", value="New", regex=True)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a change in pull request #35747: Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r821350474



##########
File path: python/pyspark/pandas/series.py
##########
@@ -4283,17 +4283,20 @@ def keys(self) -> "ps.Index":
         """
         return self.index
 
-    # TODO: 'regex', 'method' parameter
+    # TODO: introduce 'method', 'limit', 'in_place'; fully support 'regex'
     def replace(
         self,
         to_replace: Optional[Union[Any, List, Tuple, Dict]] = None,
         value: Optional[Union[List, Tuple]] = None,
-        regex: bool = False,
+        regex: Union[str, bool] = False,
     ) -> "Series":
         """
         Replace values given in to_replace with value.
         Values of the Series are replaced with other values dynamically.
 
+        .. note:: The API supports pattern matching (and replacement) on the whole string only,

Review comment:
       ```py
   >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   >>> psser.replace('ba', 'xx', regex=True)
   0     xx
   1    foo
   2     xx
   3    abc
   4     xx
   5    zoo
   dtype: object
   ```
   vs.
   ```py
   >>> pser = pd.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   >>> pser.replace('ba', 'xx', regex=True)
   0     xxt
   1     foo
   2    xxit
   3     abc
   4     xxr
   5     zoo
   dtype: object
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1063665449


   Thanks @bjornjorgensen! Sorry, I may not fully get your point, why is `bait` confusing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a change in pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r826634170



##########
File path: python/pyspark/pandas/tests/test_series.py
##########
@@ -1735,15 +1735,32 @@ def test_replace(self):
         self.assert_eq(psser.replace([10, 15], [45, 50]), pser.replace([10, 15], [45, 50]))
         self.assert_eq(psser.replace((10, 15), (45, 50)), pser.replace((10, 15), (45, 50)))
 
+        pser = pd.Series(["bat", "foo", "bait", "abc", "bar", "zoo"])
+        psser = ps.from_pandas(pser)
+        self.assert_eq(
+            psser.replace(to_replace=r"^ba.$", value="new", regex=True),
+            pser.replace(to_replace=r"^ba.$", value="new", regex=True),
+        )
+        self.assert_eq(
+            psser.replace(regex=r"^.oo$", value="new"), pser.replace(regex=r"^.oo$", value="new")
+        )
+
         msg = "'to_replace' should be one of str, list, tuple, dict, int, float"
         with self.assertRaisesRegex(TypeError, msg):
             psser.replace(ps.range(5))
         msg = "Replacement lists must match in length. Expecting 3 got 2"
         with self.assertRaisesRegex(ValueError, msg):
-            psser.replace([10, 20, 30], [1, 2])
-        msg = "replace currently not support for regex"
+            psser.replace(["bat", "foo", "bait"], ["a", "b"])
+        msg = "'to_replace' must be 'None' if 'regex' is not a bool"
+        with self.assertRaisesRegex(ValueError, msg):
+            psser.replace(to_replace="foo", regex=r"^.oo$")
+        msg = "If 'regex' is True then 'to_replace' must be a string"
+        with self.assertRaisesRegex(AssertionError, msg):
+            psser.replace(["bat", "foo", "bait"], regex=True)
+        unsupported_regex = [r"^.oo$", r"^ba.$"]
+        msg = "'regex' of %s type is not supported" % type(unsupported_regex).__name__
         with self.assertRaisesRegex(NotImplementedError, msg):
-            psser.replace(r"^1.$", regex=True)
+            psser.replace(regex=unsupported_regex, value="new")

Review comment:
       Good idea! Added.

##########
File path: python/pyspark/pandas/series.py
##########
@@ -4307,17 +4307,20 @@ def keys(self) -> "ps.Index":
         """
         return self.index
 
-    # TODO: 'regex', 'method' parameter
+    # TODO: introduce 'method', 'limit', 'in_place'; fully support 'regex'
     def replace(
         self,
         to_replace: Optional[Union[Any, List, Tuple, Dict]] = None,
         value: Optional[Union[List, Tuple]] = None,
-        regex: bool = False,
+        regex: Union[str, bool] = False,
     ) -> "Series":
         """
         Replace values given in to_replace with value.
         Values of the Series are replaced with other values dynamically.
 
+        .. note:: For partial pattern matching, the replacement is against the whole string,
+            which is different from pandas'.

Review comment:
       Certainly. Added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a change in pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r822528560



##########
File path: python/pyspark/pandas/series.py
##########
@@ -4440,13 +4449,45 @@ def replace(
                 weight      1.0
                 length      0.3
         dtype: float64
+
+        Regular expression `to_replace`
+
+        >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
+        >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
+        0     new
+        1     foo
+        2    bait
+        3     abc
+        4     new
+        5     zoo
+        dtype: object
+
+        >>> psser.replace(value='new', regex=r'^.oo$')
+        0     bat
+        1     new
+        2    bait
+        3     abc
+        4     bar
+        5     new
+        dtype: object
         """
+        if isinstance(regex, str):
+            if to_replace is not None:
+                raise ValueError("'to_replace' must be 'None' if 'regex' is not a bool")
+            to_replace = regex
+            regex = True
+        elif not isinstance(regex, bool):
+            raise NotImplementedError("regex of %s type is supported" % type(regex).__name__)

Review comment:
       Good catch! Thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks edited a comment on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks edited a comment on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1064637020


   The result for the example you give should be 
   ```py
   >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
   0     new
   1     foo
   2    bait  # bait is not replaced since it doesn't pattern match `'^ba.$'`
   3     abc
   4     new
   5     zoo
   dtype: object
   ```
   
   The existing example in the PR `psser.replace('ba', 'xx', regex=True)` is to show even `ba` partially match `bait`, the replacement `xx` is applied to the whole string `bait`, i.e. `bait` is replaced to `xx`.
   
   CC @bjornjorgensen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1062453325


   May I get a review when you are free? @ueshin @HyukjinKwon @itholic Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1066254972


   Rebased to get the doc fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks edited a comment on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks edited a comment on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1064637020


   The result for the example you give should be 
   ```py
   >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
   0     new
   1     foo
   2    bait  # bait is not replaced since it doesn't pattern match `'^ba.$'`
   3     abc
   4     new
   5     zoo
   dtype: object
   ```
   
   The existing example in the PR `psser.replace('ba', 'xx', regex=True)` is to show even `ba` partially match `bait`, the replacement `xx` is applied to the whole string `bait`, i.e. `bait` is replaced to `xx`.
   
   I will rephrase the docstring to make it clear. Let me know if you still have questions.
   
   CC @bjornjorgensen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks edited a comment on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks edited a comment on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1064637020


   The result for the example you give should be 
   ```py
   >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
   0     new
   1     foo
   2    bait  # bait is not replaced since it doesn't pattern match `'^ba.$'`
   3     abc
   4     new
   5     zoo
   dtype: object
   ```
   
   The existing example in the PR `psser.replace('ba', 'xx', regex=True)` is to show even `ba` partially match `bait`, the replacement `xx` is applied to the whole string `bait`, i.e. `bait` is replaced to `xx`.
   
   CC @bjornjorgensen 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #35747:
URL: https://github.com/apache/spark/pull/35747


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bjornjorgensen edited a comment on pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
bjornjorgensen edited a comment on pull request #35747:
URL: https://github.com/apache/spark/pull/35747#issuecomment-1064479400


   Yes, I think it will be easier for users and others to understand what the function does if you change the regex to change `bait` as well as the regex to change `bat` and `bar`. After reading the PR, I was left with the question of why `bait` was also not changed.
   If you change the regex so that the result looks like this
   
   `
   >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
   >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
   0     new
   1     foo
   2     new
   3     abc
   4     new
   5     zoo
   `
   
   then I think it will be easier to understand.
   
   KISS - Keep It Simple, Stupid


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-databricks commented on a change in pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
xinrong-databricks commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r822528761



##########
File path: python/pyspark/pandas/series.py
##########
@@ -4283,17 +4283,20 @@ def keys(self) -> "ps.Index":
         """
         return self.index
 
-    # TODO: 'regex', 'method' parameter
+    # TODO: introduce 'method', 'limit', 'in_place'; fully support 'regex'
     def replace(
         self,
         to_replace: Optional[Union[Any, List, Tuple, Dict]] = None,
         value: Optional[Union[List, Tuple]] = None,
-        regex: bool = False,
+        regex: Union[str, bool] = False,
     ) -> "Series":
         """
         Replace values given in to_replace with value.
         Values of the Series are replaced with other values dynamically.
 
+        .. note:: The API supports pattern matching (and replacement) on the whole string only,

Review comment:
       Good idea, added!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a change in pull request #35747: [SPARK-38441][PYTHON] Support string and bool `regex` in `Series.replace`

Posted by GitBox <gi...@apache.org>.
itholic commented on a change in pull request #35747:
URL: https://github.com/apache/spark/pull/35747#discussion_r822410325



##########
File path: python/pyspark/pandas/series.py
##########
@@ -4440,13 +4449,45 @@ def replace(
                 weight      1.0
                 length      0.3
         dtype: float64
+
+        Regular expression `to_replace`
+
+        >>> psser = ps.Series(['bat', 'foo', 'bait', 'abc', 'bar', 'zoo'])
+        >>> psser.replace(to_replace=r'^ba.$', value='new', regex=True)
+        0     new
+        1     foo
+        2    bait
+        3     abc
+        4     new
+        5     zoo
+        dtype: object
+
+        >>> psser.replace(value='new', regex=r'^.oo$')
+        0     bat
+        1     new
+        2    bait
+        3     abc
+        4     bar
+        5     new
+        dtype: object
         """
+        if isinstance(regex, str):
+            if to_replace is not None:
+                raise ValueError("'to_replace' must be 'None' if 'regex' is not a bool")
+            to_replace = regex
+            regex = True
+        elif not isinstance(regex, bool):
+            raise NotImplementedError("regex of %s type is supported" % type(regex).__name__)

Review comment:
       Maybe `regex of %s type is "not" supported` ??

##########
File path: python/pyspark/pandas/series.py
##########
@@ -4283,17 +4283,20 @@ def keys(self) -> "ps.Index":
         """
         return self.index
 
-    # TODO: 'regex', 'method' parameter
+    # TODO: introduce 'method', 'limit', 'in_place'; fully support 'regex'
     def replace(
         self,
         to_replace: Optional[Union[Any, List, Tuple, Dict]] = None,
         value: Optional[Union[List, Tuple]] = None,
-        regex: bool = False,
+        regex: Union[str, bool] = False,
     ) -> "Series":
         """
         Replace values given in to_replace with value.
         Values of the Series are replaced with other values dynamically.
 
+        .. note:: The API supports pattern matching (and replacement) on the whole string only,

Review comment:
       How about we also add this example to the note ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org