You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mukul Murthy (JIRA)" <ji...@apache.org> on 2019/06/13 23:41:00 UTC

[jira] [Updated] (SPARK-28043) Reading json with duplicate columns drops the first column value

     [ https://issues.apache.org/jira/browse/SPARK-28043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mukul Murthy updated SPARK-28043:
---------------------------------
    Description: 
When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.

 

I'm guessing somewhere when parsing JSON, we're turning it into a Map which is causing the first value to be overridden.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": \"blah2\"}"])
 >>> df = spark.read.json(jsonRDD)
 >>> df.show()
 +-----+----+
|a|a|

+-----+----+
|null|blah2|

+-----+----+

 

The expected response would be:

+-----+----+
|a|a|

+-----+----+
|blah|blah2|

+-----+----+

  was:
When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": \"blah2\"}"])
>>> df = spark.read.json(jsonRDD)
>>> df.show()
+----+-----+
| a| a|
+----+-----+
|null|blah2|
+----+-----+

 

The expected response would be:

+----+-----+
| a| a|
+----+-----+
|blah|blah2|
+----+-----+


> Reading json with duplicate columns drops the first column value
> ----------------------------------------------------------------
>
>                 Key: SPARK-28043
>                 URL: https://issues.apache.org/jira/browse/SPARK-28043
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Mukul Murthy
>            Priority: Major
>
> When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.
>  
> I'm guessing somewhere when parsing JSON, we're turning it into a Map which is causing the first value to be overridden.
>  
> Repro (Python, 2.4):
> >>> jsonRDD = spark.sparkContext.parallelize(["\\{ \"a\": \"blah\", \"a\": \"blah2\"}"])
>  >>> df = spark.read.json(jsonRDD)
>  >>> df.show()
>  +-----+----+
> |a|a|
> +-----+----+
> |null|blah2|
> +-----+----+
>  
> The expected response would be:
> +-----+----+
> |a|a|
> +-----+----+
> |blah|blah2|
> +-----+----+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org