You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ganesh Sivalingam (JIRA)" <ji...@apache.org> on 2016/09/29 17:16:20 UTC
[jira] [Created] (SPARK-17727) PySpark SQL arrays are not
immutable, .remove and .pop cause issues
Ganesh Sivalingam created SPARK-17727:
-----------------------------------------
Summary: PySpark SQL arrays are not immutable, .remove and .pop cause issues
Key: SPARK-17727
URL: https://issues.apache.org/jira/browse/SPARK-17727
Project: Spark
Issue Type: Bug
Components: PySpark, SQL
Affects Versions: 2.0.0
Environment: OS X and Linux (Amazon Linux AMI release 2016.03), Python 2.x
Reporter: Ganesh Sivalingam
When having one column of a DataFrame as an array, for example:
```
+-------+---+---------+
|join_on| a| b|
+-------+---+---------+
| 1| 1|[1, 2, 3]|
| 1| 2|[1, 2, 3]|
| 1| 3|[1, 2, 3]|
+-------+---+---------+
```
If I try to remove the value in column a from the array in column b using python's `list.remove(val)` function. It works, however, after running a second manipulation of the dataframe it fails with an error saying that the item (value in column a) is not present.
So PySpark is re-running the `list.remove()` but on the already altered list/array.
Below is a minimal example, which I think should work, however exhibits this issue:
```
import pyspark.sql.functions as F
import pyspark.sql.types as T
import numpy as np
cols = ['join_on', 'a']
vals = [
(1, 1),
(1, 2),
(1, 3)
]
df = sqlContext.createDataFrame(vals, cols)
df_of_arrays = df\
.groupBy('join_on')\
.agg(F.collect_list('a').alias('b'))
df = df\
.join(df_of_arrays, on='join_on')
df.show()
def rm_element(a, list_a):
list_a.remove(a)
return list_a
rm_element_udf = F.udf(rm_element, T.ArrayType(T.LongType()))
df = df.withColumn('one_removed', rm_element_udf("a", "b"))
df.show()
answer = df.withColumn('av', F.udf(lambda a: float(np.mean(a)))('one_removed'))
answer.show()
```
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org