You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Souder (Jira)" <ji...@apache.org> on 2020/03/25 22:57:00 UTC

[jira] [Created] (SPARK-31256) Dropna doesn't work for struct columns

Michael Souder created SPARK-31256:
--------------------------------------

             Summary: Dropna doesn't work for struct columns
                 Key: SPARK-31256
                 URL: https://issues.apache.org/jira/browse/SPARK-31256
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.5
         Environment: Spark 2.4.5

Python 3.7.4
            Reporter: Michael Souder


Dropna using a subset with a column from a struct drops the entire data frame.
{code:python}
import pyspark.sql.functions as F

df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, None)], schema=['age', 'height', 'name'])
df.show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|  null|  Bob|
| 15|    80| null|
+---+------+-----+

# this works just fine
df.dropna(subset=['name']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|  null|  Bob|
+---+------+-----+

# now add a struct column
df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', 'name'))
df_with_struct.show(truncate=False)
+---+------+-----+--------------+
|age|height|name |struct_col    |
+---+------+-----+--------------+
|5  |80    |Alice|[5, 80, Alice]|
|10 |null  |Bob  |[10,, Bob]    |
|15 |80    |null |[15, 80,]     |
+---+------+-----+--------------+

# now dropna drops the whole dataframe when you use struct_col
df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
+---+------+----+----------+
|age|height|name|struct_col|
+---+------+----+----------+
+---+------+----+----------+
{code}
 I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 with python 3.6.8 and in both, the result looks like:
{code:python}
df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False)
+---+------+-----+--------------+
|age|height|name |struct_col    |
+---+------+-----+--------------+
|5  |80    |Alice|[5, 80, Alice]|
|10 |null  |Bob  |[10,, Bob]    |
+---+------+-----+--------------+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org