You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/07/10 08:18:11 UTC

[jira] [Commented] (SPARK-16472) Inconsistent nullability in schema after being read in SQL API.

    [ https://issues.apache.org/jira/browse/SPARK-16472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369489#comment-15369489 ] 

Apache Spark commented on SPARK-16472:
--------------------------------------

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/14124

> Inconsistent nullability in schema after being read in SQL API.
> ---------------------------------------------------------------
>
>                 Key: SPARK-16472
>                 URL: https://issues.apache.org/jira/browse/SPARK-16472
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> It seems the data sources implementing {{FileFormat}} seems loading the data by forcing the fields as nullable fields. It seems this was official documented SPARK-11360 and was discussed here https://www.mail-archive.com/user@spark.apache.org/msg39230.html
> However, I realised that several APIs do not follow this. For example,
> {code}
> DataFrame.json(jsonRDD: RDD[String])
> {code}
> So, the codes below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
> val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
> val df = spark.read.schema(schema).json(rdd)
> df.printSchema()
> {code}
> prints below:
> {code}
> root
>  |-- a: integer (nullable = false)
> {code}
> This API loads the schema as it is after loading. However, the schema became different when loading it by the API below (nullable fields) :
> {code}
> spark.read.format("json").schema(...).load(path).printSchema()
> {code}
> {code}
> spark.read.schema(...).load(path).printSchema()
> {code}
> produce below:
> {code}
> root
>  |-- a: integer (nullable = true)
> {code}
> In addition, this is happening for structured streaming as well. (even when we read batch after writing it by structured streaming).
> While testing, I wrote some tests codes and patches. Please see the following PR for more cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org