You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Maciej Szymkiewicz (Jira)" <ji...@apache.org> on 2022/02/01 12:00:00 UTC

[jira] [Assigned] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

     [ https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maciej Szymkiewicz reassigned SPARK-38067:
------------------------------------------

    Assignee: Bjørn Jørgensen

> Inconsistent missing values handling in Pandas on Spark to_json
> ---------------------------------------------------------------
>
>                 Key: SPARK-38067
>                 URL: https://issues.apache.org/jira/browse/SPARK-38067
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Bjørn Jørgensen
>            Assignee: Bjørn Jørgensen
>            Priority: Major
>
> If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing values are written explicitly 
> {code:python}
> import pandas as pd
> import pyspark.pandas as ps
> pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
> psf = ps.from_pandas(pdf)
> psf.to_json()
> ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
> {code:python}
> This behavior is consistent with Pandas:
> {code:python}
> pdf.to_json()
> ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
> {code}
> However, if {{path}} is provided, missing values are omitted by default:
> {code:python}
> import tempfile
> path = tempfile.mktemp()
> psf.to_json(path)
> spark.read.text(path).show()
> ## +--------------------+
> ## |               value|
> ## +--------------------+
> ## |{"id":2,"value":3.0}|
> ## |            {"id":3}|
> ## |            {"id":1}|
> ## +--------------------+
> {code}
> We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, so both cases handle missing values in the same way.
> {code:python}
> psf.to_json(path, ignoreNullFields=False)
> spark.read.text(path).show(truncate=False)
> ## +---------------------+
> ## |value                |
> ## +---------------------+
> ## |{"id":3,"value":null}|
> ## |{"id":1,"value":null}|
> ## |{"id":2,"value":3.0} |
> ## +---------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org