You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Mukul Murthy (JIRA)" <ji...@apache.org> on 2019/06/13 23:39:00 UTC

[jira] [Created] (SPARK-28043) Reading json with duplicate columns drops the first column value

Mukul Murthy created SPARK-28043:
------------------------------------

             Summary: Reading json with duplicate columns drops the first column value
                 Key: SPARK-28043
                 URL: https://issues.apache.org/jira/browse/SPARK-28043
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: Mukul Murthy


When reading a JSON blob with duplicate fields, Spark appears to ignore the value of the first one. JSON recommends unique names but does not require it; since JSON and Spark SQL both allow duplicate field names, we should fix the bug where the first column value is getting dropped.

 

Repro (Python, 2.4):

>>> jsonRDD = spark.sparkContext.parallelize(["\{ \"a\": \"blah\", \"a\": \"blah2\"}"])
>>> df = spark.read.json(jsonRDD)
>>> df.show()
+----+-----+
| a| a|
+----+-----+
|null|blah2|
+----+-----+

 

The expected response would be:

+----+-----+
| a| a|
+----+-----+
|blah|blah2|
+----+-----+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org