You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "xinrong-meng (via GitHub)" <gi...@apache.org> on 2023/03/13 02:19:59 UTC

[GitHub] [spark] xinrong-meng opened a new pull request, #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

xinrong-meng opened a new pull request, #40388:
URL: https://github.com/apache/spark/pull/40388

   ### What changes were proposed in this pull request?
   Regulate the import path of `pandas_udf`.
   
   ### Why are the changes needed?
   Usability.
   
   ### Does this PR introduce _any_ user-facing change?
   No. Only the error message is improved, as shown below.
   
   FROM
   ```sh
   >>> pyspark.sql.connect.functions.pandas_udf()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/functions.py", line 2470, in pandas_udf
       raise NotImplementedError("pandas_udf() is not implemented.")
   NotImplementedError: pandas_udf() is not implemented.
   ```
   
   TO
   ```sh
   >>> pyspark.sql.connect.functions.pandas_udf()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/xinrong.meng/spark/python/pyspark/sql/connect/functions.py", line 2470, in pandas_udf
       raise NotImplementedError("Please import pandas_udf from pyspark.sql.functions.")
   NotImplementedError: Please import pandas_udf from pyspark.sql.functions
   ```
   
   ### How was this patch tested?
   Unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40388:
URL: https://github.com/apache/spark/pull/40388#discussion_r1135137634


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -2466,10 +2471,6 @@ def udf(
 udf.__doc__ = pysparkfuncs.udf.__doc__
 
 
-def pandas_udf(*args: Any, **kwargs: Any) -> None:

Review Comment:
   and to address Takuya's comment, I guess it'd pretty easy from a cursory look. We can just throw an exception here:
   
   ```
   def pandas_udf(f=None, returnType=None, functionType=None):
       from pyspark.sql.functions import pandas_udf
       if is_remote() and "PYSPARK_NO_NAMESPACE_SHARE" not in os.environ:
           return pandas_udf(f, returnType, functionType)
       else:
           # Should not reach here in production because `PYSPARK_NO_NAMESPACE_SHARE` is
           # for test only, and users should have invoked regular `pyspark.sql.functions.pandas_udf`
           # if Spark Connect is enabled.
           raise RuntimeError("blah blah")
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on code in PR #40388:
URL: https://github.com/apache/spark/pull/40388#discussion_r1135135774


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -54,6 +54,11 @@
 from pyspark.sql import functions as pysparkfuncs
 from pyspark.sql.types import _from_numpy_type, DataType, StructType, ArrayType, StringType
 
+# The implementation of pandas_udf is embedded in pyspark.sql.function.pandas_udf
+# for code reuse.
+from pyspark.sql.functions import pandas_udf as pandas_udf

Review Comment:
   You won't need to alias it actually.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #40388:
URL: https://github.com/apache/spark/pull/40388#discussion_r1136405407


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -54,6 +54,11 @@
 from pyspark.sql import functions as pysparkfuncs
 from pyspark.sql.types import _from_numpy_type, DataType, StructType, ArrayType, StringType
 
+# The implementation of pandas_udf is embedded in pyspark.sql.function.pandas_udf
+# for code reuse.
+from pyspark.sql.functions import pandas_udf as pandas_udf

Review Comment:
   Thanks, removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on a diff in pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on code in PR #40388:
URL: https://github.com/apache/spark/pull/40388#discussion_r1133640400


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -2467,7 +2467,7 @@ def udf(
 
 
 def pandas_udf(*args: Any, **kwargs: Any) -> None:

Review Comment:
   Sounds good! Added a comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467126454

   Btw, what happens if `"PYSPARK_NO_NAMESPACE_SHARE" in os.environ`?
   
   https://github.com/apache/spark/blob/761e0c0f6f0d00733177a869b9ecdf454e13fc9f/python/pyspark/sql/utils.py#L148-L161
   
   Usually the env var is set, we need to explicitly import `pyspark.sql.connect.function`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467349620

   It's irrelevant means it's an issue, no?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467514203

   Also CC @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467510024

   I see your concern on `PYSPARK_NO_NAMESPACE_SHARE`, it should have affected the imports of `pandas_udf`, that is, if it is set, `sql.functions.pandas_udf` points to `sql.connect.functions.pandas_udf`. 
   Unfortunately in our case, there is no `sql.connect.functions.pandas_udf` at all.
   
   Considering`PYSPARK_NO_NAMESPACE_SHARE` is not a heavily-used user-facing config, shall we merge this change first for clarity(otherwise users may think pandas_udf is not implemented yet)? I will work on a follow-up to support ``sql.connect.functions.pandas_udf` with `PYSPARK_NO_NAMESPACE_SHARE` support.
   
   CC @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon closed pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`
URL: https://github.com/apache/spark/pull/40388


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467528741

   > After double thoughts, https://github.com/apache/spark/pull/40388#issuecomment-1467120067 might not be what we want, as shown below
   
   I meant:
   
   ```diff
   - def pandas_udf(*args: Any, **kwargs: Any) -> None:
   -     raise NotImplementedError("pandas_udf() is not implemented.")
   + # The implementation of pandas_udf is embedded in pyspark.sql.function.pandas_udf
   + # for code reuse.
   + from pyspark.sql.functions import pandas_udf as pandas_udf
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1469264035

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467120067

   I guess we can just with the comment:
   
   ```py
   # The implementation of pandas_udf is embedded in pyspark.sql.function.pandas_udf
   # for code reuse.
   from pyspark.sql.functions import pandas_udf as pandas_udf
   ```
   
   for those who want to import `pyspark.sql.connect.functions`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ueshin commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "ueshin (via GitHub)" <gi...@apache.org>.
ueshin commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467557107

   > The unit tests for pandas_udf are in `python/pyspark/sql/tests/connect/test_parity_pandas_udf.py` if that's your concern.
   
   It's not my concern.
   `sql.functions.pandas_udf` and `sql.connect.functions.pandas_udf` should be usable separately if `PYSPARK_NO_NAMESPACE_SHARE` is set.
   
   > Considering`PYSPARK_NO_NAMESPACE_SHARE` is not a heavily-used user-facing env variable, shall we merge the change proposed in this PR first for clarity? Otherwise, I am afraid that users think pandas_udf is not supported in Connect. I will work on a follow-up to support `sql.connect.functions.pandas_udf` with `PYSPARK_NO_NAMESPACE_SHARE` support.
   
   I'd leave the decision to @HyukjinKwon or @zhengruifeng then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on code in PR #40388:
URL: https://github.com/apache/spark/pull/40388#discussion_r1133455875


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -2467,7 +2467,7 @@ def udf(
 
 
 def pandas_udf(*args: Any, **kwargs: Any) -> None:

Review Comment:
   +1 for @zhengruifeng 's comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467558915

   Good idea! I adjusted the code and PR description accordingly. Thanks @ueshin 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Enable importing `pandas_udf` from `pyspark.sql.connect.functions`

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467582894

   and, `pyspark.sql.connect.*` isn't also supposed to be called directly by users. so this PR is just more about internal refactoring.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467493588

   After double thoughts, https://github.com/apache/spark/pull/40388#issuecomment-1467120067 might not be what we want, as shown below
   
   ```
   >>> from pyspark.sql.connect.functions import pandas_udf
   >>> spark.range(2).select(pandas_udf(lambda x: x + 1, 'int')('id')).collect()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   TypeError: 'NoneType' object is not callable
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a diff in pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #40388:
URL: https://github.com/apache/spark/pull/40388#discussion_r1133413599


##########
python/pyspark/sql/connect/functions.py:
##########
@@ -2467,7 +2467,7 @@ def udf(
 
 
 def pandas_udf(*args: Any, **kwargs: Any) -> None:

Review Comment:
   what about adding a few comments here? So that the contributors will be aware that this is already implemented



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] xinrong-meng commented on pull request #40388: [SPARK-42765][CONNECT][PYTHON] Regulate the import path of `pandas_udf`

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.
xinrong-meng commented on PR #40388:
URL: https://github.com/apache/spark/pull/40388#issuecomment-1467281968

   We didn't wrap `pyspark.sql.function.pandas_udf` with `try_remote_functions`, so  `"PYSPARK_NO_NAMESPACE_SHARE"` should be irrelevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org