You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julien Peloton (Jira)" <ji...@apache.org> on 2022/03/07 12:10:00 UTC
[jira] [Created] (SPARK-38435) Pandas UDF with type hints crashes at import
Julien Peloton created SPARK-38435:
--------------------------------------
Summary: Pandas UDF with type hints crashes at import
Key: SPARK-38435
URL: https://issues.apache.org/jira/browse/SPARK-38435
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.1.0
Environment: Spark: 3.1
Python: 3.7
Reporter: Julien Peloton
## Old style pandas UDF
let's consider a pandas UDF defined in the old style:
{code:java}
// code placeholder
{code}
```python
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType
@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
return s.str.upper()
```
I can import it and use it as:
```python
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show()
```
and launch it via:
```bash
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392:
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to
specify type hints for pandas UDF instead of specifying pandas
UDF type which will be deprecated in the future releases. See
SPARK-28264 for more details.
+--------------+
|to_upper(name)|
+--------------+
| JOHN DOE|
+--------------+
```
Except the `UserWarning`, the code is working as expected.
## New style pandas UDF: using type hint
Let's now switch to the version using type hints:
```python
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf
@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
return s.str.upper()
```
But this time, I obtain an `AttributeError`:
```
spark-submit main.py
Traceback (most recent call last):
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 835, in _parse_datatype_string
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 839, in _parse_datatype_string
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 2, in <module>
from mymod import to_upper
File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 5, in <module>
def to_upper(s: pd.Series) -> pd.Series:
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py", line 432, in _create_pandas_udf
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 43, in _create_udf
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 206, in _wrapped
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 96, in returnType
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 841, in _parse_datatype_string
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 831, in _parse_datatype_string
File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
```
The code crashes at the import level. Looking at the code, the spark context needs to exist:
https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827
which at the time of the import is not the case.
## Questions
First, am I doing something wrong? I do not see in the documentation (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) mention of this, and it seems it should affect many users that are moving from old style to new style pandas UDF.
Second, is this the expected behaviour? Looking at the old style pandas UDF, where the module can be imported without problem, the new behaviour looks like a regress. Why users would have to have the spark context active to just import a module that contains pandas UDF?
Third, what could we do? I see in the master branch that an assert has been recently added (https://issues.apache.org/jira/browse/SPARK-37620):
https://github.com/apache/spark/blob/e21cb62d02c85a66771822cdd49c49dbb3e44502/python/pyspark/sql/types.py#L1014-L1015
But the assert does not solve the problem -- it just indicates the problem.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org