You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/08/02 01:00:35 UTC

[jira] [Updated] (SPARK-16842) Concern about disallowing user-given schema for Parquet and ORC

     [ https://issues.apache.org/jira/browse/SPARK-16842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-16842:
---------------------------------
    Description: 
If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278)

- For Text
  it only supports {{StringType}}.

- For JDBC
  it does not take user-given schema since it does not implement {{SchemaRelationProvider}}.

By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive schema.

To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema. Actually, we can't give a different schema for Orc and Parquet almost at all times if my understanding it correct. 

  was:
If my understanding is correct,

If the user-given schema is different with the inferred schema, it is handled differently for each datasource.

- For JSON and CSV
  it is kind of permissive generally (for example, compatibility among numeric types).

- For ORC and Parquet
  Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278)

- For Text
  it only supports `StringType`.

- For JDBC
  it does not take user-given schema since it does not implement `SchemaRelationProvider`.

By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allows arguably permissive schema.

To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. 

So, we might have to just disallow giving user-given schema. Actually, we can't give schemas for Orc and Parquet almost at all times if my understanding it correct. 


> Concern about disallowing user-given schema for Parquet and ORC
> ---------------------------------------------------------------
>
>                 Key: SPARK-16842
>                 URL: https://issues.apache.org/jira/browse/SPARK-16842
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>
> If my understanding is correct,
> If the user-given schema is different with the inferred schema, it is handled differently for each datasource.
> - For JSON and CSV
>   it is kind of permissive generally (for example, compatibility among numeric types).
> - For ORC and Parquet
>   Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for Parquet, https://github.com/apache/spark/pull/14272 and https://github.com/apache/spark/pull/14278)
> - For Text
>   it only supports {{StringType}}.
> - For JDBC
>   it does not take user-given schema since it does not implement {{SchemaRelationProvider}}.
> By allowing the user-given schema, we can use some types such as {{DateType}} and {{TimestampType}} for JSON and CSV. CSV and JSON allow arguably permissive schema.
> To cut this short, JSON and CSV do not have the complete schema information written in the data whereas Orc and Parquet do. 
> So, we might have to just disallow giving user-given schema. Actually, we can't give a different schema for Orc and Parquet almost at all times if my understanding it correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org