You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "David Courtinot (JIRA)" <ji...@apache.org> on 2018/02/23 13:57:00 UTC

[jira] [Created] (SPARK-23494) Expose InferSchema's functionalities to the outside

David Courtinot created SPARK-23494:
---------------------------------------

             Summary: Expose InferSchema's functionalities to the outside
                 Key: SPARK-23494
                 URL: https://issues.apache.org/jira/browse/SPARK-23494
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
    Affects Versions: 2.2.1
            Reporter: David Courtinot


I'm proposing that InferSchema's internals (infer the schema of each record, merge two schemata, and canonicalize the result) to be exposed to the outside.

*Use-case*

We continuously produce large amounts of JSON data. The schema is and must be very dynamic: fields can appear and go from one day to another, most fields are nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). From there, we output the data in Parquet for later querying. This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the schema. This results on exceptions when trying to query those fields. We have had to implement cumbersome fixes for this involving a manually curated set of required fields.
 * this is expensive. Going through a sample of the data to infer the schema is still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit in memory) revealed at least as slow as traversing the sample first, and the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can easily be built on top of it in order to calculate a schema alongside an RDD calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible fields
 * the computational overhead is negligible since it happens at the same time as writing the data to an external store rather than by evaluating the RDD for the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data in a Dataset without ever paying the infer cost again (just the conversion from JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data as structured data whenever they want to even though the actual schema may vary every ten minutes as long as they record the schema of each portion of data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org