You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Myles Baker (JIRA)" <ji...@apache.org> on 2016/04/27 16:55:13 UTC

[jira] [Commented] (SPARK-7301) Issue with duplicated fields in interpreted json schemas

    [ https://issues.apache.org/jira/browse/SPARK-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260284#comment-15260284 ] 

Myles Baker commented on SPARK-7301:
------------------------------------

You should be careful inferring schema if your protocol allows for optional elements. For example, if two json docs differ, the schema is regenerated. 
You should manually construct the schema to avoid this. (For example, I read the string into a JSON object with json4s, then walk that arbitrary json into a case class)


> Issue with duplicated fields in interpreted json schemas
> --------------------------------------------------------
>
>                 Key: SPARK-7301
>                 URL: https://issues.apache.org/jira/browse/SPARK-7301
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: David Crossland
>
> I have a large json dataset that has evolved over time as such some fields seem to have slight renames or have been capitalised in some way.  This means there are certain fields that spark considers ambiguous when i attempt to access them 
> i get a 
> org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(Currency,StringType,true), StructField(currency,StringType,true);
> error
> There appears to be no way to resolve an ambiguous field after its been inferred by spark sql other than to manually construct the schema using StructType/StructField which is a bit heavy handed as the schema is quite large.  Is there some way to resolve an ambiguous reference? or affect the schema post inference? It seems like something of a bug that i cant tell spark to treat both fields as though they were the same.  Ive created a test where i manually defined a schema as 
> val schema = StructType(Seq(StructField("A", StringType, true)))
> And it returns 2 rows when i perform a count on the following dataset
> {"A":"test1"}
> {"a":"test2"}
> If i could modify the schema to remove the duplicate entries then i could work around this issue.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org