You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/07/10 08:18:11 UTC
[jira] [Commented] (SPARK-16472) Inconsistent nullability in schema
after being read in SQL API.
[ https://issues.apache.org/jira/browse/SPARK-16472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369489#comment-15369489 ]
Apache Spark commented on SPARK-16472:
--------------------------------------
User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14124
> Inconsistent nullability in schema after being read in SQL API.
> ---------------------------------------------------------------
>
> Key: SPARK-16472
> URL: https://issues.apache.org/jira/browse/SPARK-16472
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Hyukjin Kwon
> Priority: Minor
>
> It seems the data sources implementing {{FileFormat}} seems loading the data by forcing the fields as nullable fields. It seems this was official documented SPARK-11360 and was discussed here https://www.mail-archive.com/user@spark.apache.org/msg39230.html
> However, I realised that several APIs do not follow this. For example,
> {code}
> DataFrame.json(jsonRDD: RDD[String])
> {code}
> So, the codes below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
> val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
> val df = spark.read.schema(schema).json(rdd)
> df.printSchema()
> {code}
> prints below:
> {code}
> root
> |-- a: integer (nullable = false)
> {code}
> This API loads the schema as it is after loading. However, the schema became different when loading it by the API below (nullable fields) :
> {code}
> spark.read.format("json").schema(...).load(path).printSchema()
> {code}
> {code}
> spark.read.schema(...).load(path).printSchema()
> {code}
> produce below:
> {code}
> root
> |-- a: integer (nullable = true)
> {code}
> In addition, this is happening for structured streaming as well. (even when we read batch after writing it by structured streaming).
> While testing, I wrote some tests codes and patches. Please see the following PR for more cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org