You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Julien Peloton (Jira)" <ji...@apache.org> on 2022/03/07 12:10:00 UTC
[jira] [Created] (SPARK-38435) Pandas UDF with type hints crashes at import

Julien Peloton created SPARK-38435:
--------------------------------------

             Summary: Pandas UDF with type hints crashes at import
                 Key: SPARK-38435
                 URL: https://issues.apache.org/jira/browse/SPARK-38435
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.1.0
         Environment: Spark: 3.1
Python: 3.7
            Reporter: Julien Peloton


## Old style pandas UDF

let's consider a pandas UDF defined in the old style:

 

 
{code:java}
// code placeholder
{code}
 

 

```python
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StringType

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def to_upper(s):
    return s.str.upper()
```

I can import it and use it as:

```python
# main.py
from pyspark.sql import SparkSession
from mymod import to_upper

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([("John Doe",)], ("name",))
df.select(to_upper("name")).show()
```

and launch it via:

```bash
spark-submit main.py
spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py:392: 
UserWarning: In Python 3.6+ and Spark 3.0+, it is preferred to 
specify type hints for pandas UDF instead of specifying pandas 
UDF type which will be deprecated in the future releases. See 
SPARK-28264 for more details.
+--------------+
|to_upper(name)|
+--------------+
|      JOHN DOE|
+--------------+
```

Except the `UserWarning`, the code is working as expected.

## New style pandas UDF: using type hint

Let's now switch to the version using type hints:

```python
# mymod.py
import pandas as pd
from pyspark.sql.functions import pandas_udf

@pandas_udf("string")
def to_upper(s: pd.Series) -> pd.Series:
    return s.str.upper()
```

But this time, I obtain an `AttributeError`:

```
spark-submit main.py
Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 835, in _parse_datatype_string
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 839, in _parse_datatype_string
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 827, in from_ddl_datatype
AttributeError: 'NoneType' object has no attribute '_jvm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/main.py", line 2, in <module>
    from mymod import to_upper
  File "/Users/julien/Documents/workspace/myrepos/fink-filters/mymod.py", line 5, in <module>
    def to_upper(s: pd.Series) -> pd.Series:
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/pandas/functions.py", line 432, in _create_pandas_udf
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 43, in _create_udf
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 206, in _wrapped
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 96, in returnType
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 841, in _parse_datatype_string
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 831, in _parse_datatype_string
  File "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 823, in from_ddl_schema
AttributeError: 'NoneType' object has no attribute '_jvm'
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
```

The code crashes at the import level. Looking at the code, the spark context needs to exist:

https://github.com/apache/spark/blob/8d70d5da3d74ebdd612b2cdc38201e121b88b24f/python/pyspark/sql/types.py#L819-L827

which at the time of the import is not the case. 

## Questions

First, am I doing something wrong? I do not see in the documentation (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.pandas_udf.html) mention of this, and it seems it should affect many users that are moving from old style to new style pandas UDF.

Second, is this the expected behaviour? Looking at the old style pandas UDF, where the module can be imported without problem, the new behaviour looks like a regress. Why users would have to have the spark context active to just import a module that contains pandas UDF? 

Third, what could we do? I see in the master branch that an assert has been recently added (https://issues.apache.org/jira/browse/SPARK-37620):

https://github.com/apache/spark/blob/e21cb62d02c85a66771822cdd49c49dbb3e44502/python/pyspark/sql/types.py#L1014-L1015

But the assert does not solve the problem -- it just indicates the problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org