You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2023/05/24 11:41:28 UTC

[GitHub] [spark] panbingkun opened a new pull request, #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

panbingkun opened a new pull request, #41296:
URL: https://github.com/apache/spark/pull/41296

   ### What changes were proposed in this pull request?
   The pr aims to implement 'levenshtein(str1, str2[, threshold])' functions in python client
   
   ### Why are the changes needed?
   After Add a max distance argument to the levenshtein() function We have already implemented it on the scala side, so we need to align it on `pyspark`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   - Manual testing
   python/run-tests --testnames 'python.pyspark.sql.tests.test_functions FunctionsTests.test_levenshtein_function'
   - Pass GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41296: [SPARK-43773][CONNECT][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41296:
URL: https://github.com/apache/spark/pull/41296#issuecomment-1565765182

   merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on a diff in pull request #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #41296:
URL: https://github.com/apache/spark/pull/41296#discussion_r1205017317


##########
python/pyspark/sql/functions.py:
##########
@@ -6594,20 +6594,28 @@ def substring_index(str: "ColumnOrName", delim: str, count: int) -> Column:
 
 
 @try_remote_functions
-def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column:
+def levenshtein(
+    left: "ColumnOrName", right: "ColumnOrName", threshold: Optional[int] = None
+) -> Column:
     """Computes the Levenshtein distance of the two given strings.
 
     .. versionadded:: 1.5.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
+    .. versionchanged:: 3.5.0
+        Supports Spark Connect.

Review Comment:
   This is done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41296:
URL: https://github.com/apache/spark/pull/41296#issuecomment-1563729956

   > @panbingkun you would need `dev/reformat-python` to fix python linter issue
   
   This is done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41296:
URL: https://github.com/apache/spark/pull/41296#issuecomment-1563719262

   > @panbingkun you would need `dev/reformat-python` to fix python linter issue
   
   Ok, let me try. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] panbingkun commented on pull request #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #41296:
URL: https://github.com/apache/spark/pull/41296#issuecomment-1562131685

   Waiting for https://github.com/apache/spark/pull/41293


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #41296:
URL: https://github.com/apache/spark/pull/41296#discussion_r1204909613


##########
python/pyspark/sql/functions.py:
##########
@@ -6594,20 +6594,28 @@ def substring_index(str: "ColumnOrName", delim: str, count: int) -> Column:
 
 
 @try_remote_functions
-def levenshtein(left: "ColumnOrName", right: "ColumnOrName") -> Column:
+def levenshtein(
+    left: "ColumnOrName", right: "ColumnOrName", threshold: Optional[int] = None
+) -> Column:
     """Computes the Levenshtein distance of the two given strings.
 
     .. versionadded:: 1.5.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
+    .. versionchanged:: 3.5.0
+        Supports Spark Connect.

Review Comment:
   I prefer another `versionadded ` after parameter `threshold`, you can refer to https://github.com/apache/spark/blob/ab4693d979b4879ca07268e3719c20a5088e87ec/python/pyspark/sql/pandas/map_ops.py#L55-L67



##########
python/pyspark/sql/tests/connect/test_connect_function.py:
##########
@@ -1920,10 +1920,16 @@ def test_string_functions_multi_args(self):
             cdf.select(CF.substring_index(cdf.e, ".", 2)).toPandas(),
             sdf.select(SF.substring_index(sdf.e, ".", 2)).toPandas(),
         )
+

Review Comment:
   
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on pull request #41296: [SPARK-43773][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #41296:
URL: https://github.com/apache/spark/pull/41296#issuecomment-1563703633

   @panbingkun you would need `dev/reformat-python` to fix python linter issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #41296: [SPARK-43773][CONNECT][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #41296: [SPARK-43773][CONNECT][PYTHON] Implement 'levenshtein(str1, str2[, threshold])' functions in python client
URL: https://github.com/apache/spark/pull/41296


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org