You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:02:32 UTC
[jira] [Updated] (SPARK-23494) Expose InferSchema's functionalities to the outside

     [ https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-23494:
---------------------------------
    Labels: bulk-closed  (was: )

> Expose InferSchema's functionalities to the outside
> ---------------------------------------------------
>
>                 Key: SPARK-23494
>                 URL: https://issues.apache.org/jira/browse/SPARK-23494
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: David Courtinot
>            Priority: Major
>              Labels: bulk-closed
>
> I'm proposing that InferSchema's internals (infer the schema of each record, merge two schemata, and canonicalize the result) be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and must be very dynamic: fields can appear and go from one day to another, most fields are nullable, some fields have small frequency etc.
> In another job, we download this data, sample it and infer the schema using Dataset.schema(). From there, we convert the data in Parquet and upload it somewhere for later querying. This approach has proved problematic:
>  *  rare fields can be absent from a sample, and therefore absent from the schema. This results on exceptions when trying to query those fields. We have had to implement cumbersome fixes for this involving a manually curated set of required fields.
>  * this is expensive. Going through a sample of the data to infer the schema is still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit in memory) revealed to be at least as slow as traversing the sample first, and the whole data next.
> *Proposition*
> InferSchema is essentially a fold operator. This means a Spark accumulator can easily be built on top of it in order to calculate a schema alongside an RDD calculation. In the above use-case, it has two main advantages:
>  * the schema is inferred on the entire data, therefore contains all possible fields no matter how low is their frequency.
>  * the computational overhead is negligible since it happens at the same time as writing the data to an external store rather than by evaluating the RDD for the sole purpose of schema inference.
>  * after writing the schema to an external store, we can load the JSON data in a Dataset without ever paying the inference cost again (just the conversion from JSON to Row). We keep the advantages and flexibility of JSON while also benefiting from the powerful features and optimizations available on Datasets or Parquet itself.
> With such feature, users can decide to use their JSON (or whatever else) data as structured data whenever they want to even though the actual schema may vary every ten minutes as long as they record the schema of each portion of data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org