You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hasil Sharma (JIRA)" <ji...@apache.org> on 2016/08/07 14:00:24 UTC

[jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType

    [ https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410946#comment-15410946 ] 

Hasil Sharma commented on SPARK-12436:
--------------------------------------

Is this issue solved ? If not, would like to contribute

> If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-12436
>                 URL: https://issues.apache.org/jira/browse/SPARK-12436
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Reynold Xin
>              Labels: starter
>
> Right now, JSON's inferSchema will return {{StringType}} for a field that always has null values or an {{ArrayType(StringType)}}  for a field that always has empty array values. Although this behavior makes writing JSON data to other data sources easy (i.e. when writing data, we do not need to remove those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream application hard to reason about the actual schema of the data and thus makes schema merging hard. We should allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. Also, we need to make sure that when we write data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} columns first. 
> Besides  {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). 
> To finish this work, we need to finish the following sub-tasks:
> * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}.
> * Determine whether we need to add the operation of removing {{NullType}} and {{ArrayType(NullType)}} columns from the data that will be write out for all data sources (i.e. data sources based our data source API and Hive tables). Or, we should just add this operation for certain data sources (e.g. Parquet). For example, we may not need this operation for Hive because Hive has VoidObjectInspector.
> * Implement the change and get it merged to Spark master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org