You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2021/07/06 02:02:00 UTC

[jira] [Commented] (SPARK-34982) Pyspark asDict() returns wrong child field for nested dataframe

    [ https://issues.apache.org/jira/browse/SPARK-34982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375108#comment-17375108 ] 

Hyukjin Kwon commented on SPARK-34982:
--------------------------------------

Hm, I can't reproduce it in the current master branch.

> Pyspark asDict() returns wrong child field for nested dataframe
> ---------------------------------------------------------------
>
>                 Key: SPARK-34982
>                 URL: https://issues.apache.org/jira/browse/SPARK-34982
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.1, 3.0.2
>         Environment: Tested with EMR 6.2.0. spark v3.0.2, python v3.8.5
> Also Tested with local pyspark on windows 10. spark v3.0.1. python v3.8.5
>            Reporter: Kumaresh AK
>            Priority: Minor
>         Attachments: SPARK-34982.py
>
>
> Hello! I upgraded a job to Spark 3.0.1 (from 2.4.4) and encountered this issue. The job uses asDict(True) in pyspark. I reproduced the issue with a concise schema and code. Consider this example schema:
> {code:java}
> root
>  |-- id: integer (nullable = false)
>  |-- struct_1: struct (nullable = true)
>  | |-- array_1_1: array (nullable = true)
>  | | |-- element: string (containsNull = false)
>  |-- struct_2: struct (nullable = true)
>  | |-- array_2_1: array (nullable = true)
>  | | |-- element: string (containsNull = false){code}
> I created 100 rows with the above schema filled it with some numbers and checked the row.asDict(True) against the input. For some rows
> {code:java}
> struct_1.array_1_1{code}
> is missing. Instead I get
> {code:java}
> struct_1.array_2_1{code}
> And I also observe this happens when array_1_1 is null. Example assert failure:
> {code:java}
> AssertionError: {'id': 7, 'struct_1': {'array_2_1': None}, 'struct_2': {'array_2_1': None}} != {'id': 7, 'struct_1': {'array_1_1': None}, 'struct_2': {'array_2_1': None}}
> {code}
>  I have attached a minimal script that reproduces this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org