You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kumaresh AK (Jira)" <ji...@apache.org> on 2021/04/07 22:14:00 UTC
[jira] [Created] (SPARK-34982) Pyspark asDict() returns wrong
fields for a nested dataframe
Kumaresh AK created SPARK-34982:
-----------------------------------
Summary: Pyspark asDict() returns wrong fields for a nested dataframe
Key: SPARK-34982
URL: https://issues.apache.org/jira/browse/SPARK-34982
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.0.2, 3.0.1
Environment: Tested with EMR 6.2.0. python: 3.8.5
Also Tested with local pyspark on windows. v: 3.0.1. python: 3.8.5
Reporter: Kumaresh AK
Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this issue. The job uses asDict(True) in pyspark. I reproduced the issue with a concise schema and code. Consider this example schema:
{code:java}
root
|-- id: integer (nullable = false)
|-- struct_1: struct (nullable = true)
| |-- array_1_1: array (nullable = true)
| | |-- element: string (containsNull = false)
|-- struct_2: struct (nullable = true)
| |-- array_2_1: array (nullable = true)
| | |-- element: string (containsNull = false){code}
I created 100 rows with the above schema filled it with some numbers and checked the row.asDict(True) against the input. For some rows
{code:java}
struct_1.array_1_1{code}
is missing. Instead I get
{code:java}
struct_1.array_2_1{code}
And I also observe this happens when array_1_1 is null. Example assert failure:
{code:java}
AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 'struct_2': {'array_2_1': None}}
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org