You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/11/03 09:13:58 UTC

[jira] [Resolved] (SPARK-17727) PySpark SQL arrays are not immutable, .remove and .pop cause issues

     [ https://issues.apache.org/jira/browse/SPARK-17727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-17727.
-------------------------------
    Resolution: Not A Problem

> PySpark SQL arrays are not immutable, .remove and .pop cause issues
> -------------------------------------------------------------------
>
>                 Key: SPARK-17727
>                 URL: https://issues.apache.org/jira/browse/SPARK-17727
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.0.0
>         Environment: OS X and Linux (Amazon Linux AMI release 2016.03), Python 2.x
>            Reporter: Ganesh Sivalingam
>
> When having one column of a DataFrame as an array, for example:
> {code}
> +-------+---+---------+
> |join_on|  a|        b|
> +-------+---+---------+
> |      1|  1|[1, 2, 3]|
> |      1|  2|[1, 2, 3]|
> |      1|  3|[1, 2, 3]|
> +-------+---+---------+
> {code}
> If I try to remove the value in column a from the array in column b using python's `list.remove(val)` function. It works, however, after running a second manipulation of the dataframe it fails with an error saying that the item (value in column a) is not present.
> So PySpark is re-running the `list.remove()` but on the already altered list/array.
> Below is a minimal example, which I think should work, however exhibits this issue:
> {code:python}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> import numpy as np
> cols = ['join_on', 'a']
> vals = [
>     (1, 1),
>     (1, 2),
>     (1, 3)
> ]
> df = sqlContext.createDataFrame(vals, cols)
> df_of_arrays = df\
>     .groupBy('join_on')\
>     .agg(F.collect_list('a').alias('b'))
> df = df\
>     .join(df_of_arrays, on='join_on')
> df.show()
> def rm_element(a, list_a):
>     list_a.remove(a)
>     return list_a
> rm_element_udf = F.udf(rm_element, T.ArrayType(T.LongType()))
> df = df.withColumn('one_removed', rm_element_udf("a", "b"))
> df.show()
> answer = df.withColumn('av', F.udf(lambda a: float(np.mean(a)))('one_removed'))
> answer.show()
> {code}
> This can then be fixed by changing the rm_element function to:
> {code:python}
> def rm_element(a, list_a):
>     tmp_list_a = copy.deepcopy(list_a)
>     tmp_list_a.remove(a)
>     return tmp_list_a
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org