You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Ganesh Sivalingam (JIRA)" <ji...@apache.org> on 2016/09/29 17:16:20 UTC

[jira] [Created] (SPARK-17727) PySpark SQL arrays are not immutable, .remove and .pop cause issues

Ganesh Sivalingam created SPARK-17727:
-----------------------------------------

             Summary: PySpark SQL arrays are not immutable, .remove and .pop cause issues
                 Key: SPARK-17727
                 URL: https://issues.apache.org/jira/browse/SPARK-17727
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.0.0
         Environment: OS X and Linux (Amazon Linux AMI release 2016.03), Python 2.x
            Reporter: Ganesh Sivalingam


When having one column of a DataFrame as an array, for example:

```
+-------+---+---------+
|join_on|  a|        b|
+-------+---+---------+
|      1|  1|[1, 2, 3]|
|      1|  2|[1, 2, 3]|
|      1|  3|[1, 2, 3]|
+-------+---+---------+
```

If I try to remove the value in column a from the array in column b using python's `list.remove(val)` function. It works, however, after running a second manipulation of the dataframe it fails with an error saying that the item (value in column a) is not present.

So PySpark is re-running the `list.remove()` but on the already altered list/array.

Below is a minimal example, which I think should work, however exhibits this issue:

```
import pyspark.sql.functions as F
import pyspark.sql.types as T
import numpy as np

cols = ['join_on', 'a']
vals = [
    (1, 1),
    (1, 2),
    (1, 3)
]

df = sqlContext.createDataFrame(vals, cols)
df_of_arrays = df\
    .groupBy('join_on')\
    .agg(F.collect_list('a').alias('b'))

df = df\
    .join(df_of_arrays, on='join_on')
df.show()

def rm_element(a, list_a):
    list_a.remove(a)
    return list_a

rm_element_udf = F.udf(rm_element, T.ArrayType(T.LongType()))
df = df.withColumn('one_removed', rm_element_udf("a", "b"))
df.show()

answer = df.withColumn('av', F.udf(lambda a: float(np.mean(a)))('one_removed'))
answer.show()
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org