You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Carlos Gameiro (Jira)" <ji...@apache.org> on 2021/11/23 12:36:00 UTC

[jira] [Updated] (SPARK-37449) Side effects between PySpark Pandas UDF and Numpy indexing

     [ https://issues.apache.org/jira/browse/SPARK-37449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carlos Gameiro updated SPARK-37449:
-----------------------------------
    Summary: Side effects between PySpark Pandas UDF and Numpy indexing  (was: Side effects between PySpark and Numpy)

> Side effects between PySpark Pandas UDF and Numpy indexing
> ----------------------------------------------------------
>
>                 Key: SPARK-37449
>                 URL: https://issues.apache.org/jira/browse/SPARK-37449
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.1.2
>            Reporter: Carlos Gameiro
>            Priority: Critical
>              Labels: NumPy, Pandas, Pygeos, UDF, applyInPandas
>
> I'm using pygeos 0.11.1.
> Let's create a simple Pandas Dataframe with a single column named 'id' with a range.
> {code:java}
> df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
> Consider this simple function that selects the first 4 indexes of the 'id' column of an array.
> {code:java}
> def udf_example(df):
>   
>   some_index = np.array([0, 1, 2, 3])
>   values = df['id'].values[some_index]
>   
>   df = pd.DataFrame(values, columns=['id'])
>   return df{code}
> If I apply this function in Pyspark I get this result:
> {code:java}
> schema = t.StructType([t.StructField('id', t.LongType(), True)])
> df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, schema)
> display(df_spark)
> # id
> # 125
> # 126
> # 127
> # 128
> {code}
> If I apply it in Python I get the correct and expected result:
> {code:java}
> udf_example(df)
> # id
> # 0
> # 1
> # 2
> # 3
> {code}
> Using NumPy indexing operations inside a Pandas UDF in Spark causes side effects and unexpected results.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org