You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sameer Agarwal (JIRA)" <ji...@apache.org> on 2018/01/08 20:53:00 UTC

[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

     [ https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Agarwal updated SPARK-15693:
-----------------------------------
    Target Version/s: 2.4.0  (was: 2.3.0)

> Write schema definition out for file-based data sources to avoid schema inference
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-15693
>                 URL: https://issues.apache.org/jira/browse/SPARK-15693
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> Spark supports reading a variety of data format, many of which don't have self-describing schema. For these file formats, Spark often can infer the schema by going through all the data. However, schema inference is expensive and does not always infer the intended schema (for example, with json data Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based formats, and when reading the data in, schema can be "inferred" directly by reading the schema definition file without going through full schema inference. If the file does not exist, then the good old schema inference should be performed.
> This ticket certainly merits a design doc that should discuss the spec for schema definition, as well as all the corner cases that this feature needs to handle (e.g. schema merging, schema evolution, partitioning). It would be great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org