You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Carlos Gameiro (Jira)" <ji...@apache.org> on 2021/11/23 12:13:00 UTC
[jira] [Created] (SPARK-37449) Side effects between PySpark, Numpy and Pygeos
Carlos Gameiro created SPARK-37449:
--------------------------------------
Summary: Side effects between PySpark, Numpy and Pygeos
Key: SPARK-37449
URL: https://issues.apache.org/jira/browse/SPARK-37449
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.1.2
Reporter: Carlos Gameiro
I'm using pygeos 0.11.1.
Let's create a simple Pandas Dataframe with a single column named 'id' with a range:
{code:java}
df = pd.DataFrame(np.arange(0,1000), columns=['id']){code}
Consider this simple function that selects the first 4 indexes of the 'id' column of an array, and that for some reason calls a Pyegos operation in the beginning.
{code:java}
def udf_example(df):
geo = pygeos.from_wkt(np.array(['POINT (20 30)', 'POINT (34 -2)', 'POINT (20 30)']))
some_index = np.array([0, 1, 2, 3])
values = df['id'].values[some_index]
df = pd.DataFrame(values, columns=['id'])
return df{code}
If I apply this function in Pyspark I get this result:
{code:java}
schema = t.StructType([t.StructField('id', t.LongType(), True)])
df_spark = spark.createDataFrame(df).groupBy().applyInPandas(udf_example, schema)
display(df_spark)
# id
# 125
# 126
# 127
# 128
{code}
If I apply it in Python I get the correct and expected result:
{code:java}
udf_example(df)
# id
# 0
# 1
# 2
# 3
{code}
Using a Pygeos function together with Spark causes side effects on NumPy indexing operations.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org