You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/12/28 22:12:00 UTC
[jira] [Commented] (DRILL-6062) Simplify JSON input format

    [ https://issues.apache.org/jira/browse/DRILL-6062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305778#comment-16305778 ] 

Paul Rogers commented on DRILL-6062:
------------------------------------

h4. Deprecate Schema Changes

Drill (does | does not) support internal schema changes. (We actually don't know whether Drill does or does not.) On the one hand, much code attempts to handle schema changes, and schema change exception issues are filed as bugs and are occasionally fixed (or a fix is attempted.)

If we take take a step back, we can ask how much schema change Drill *should* support. Perhaps the following should be allowed:

* Missing values for nullable columns (as long as the actual column is a nullable int, since otherwise Drill can't know the type.)
* Permutated column order.

Code attempts (and often fails) to handle other types of schema change:

* Conflicting data types.
* Missing required columns.
* Conflicting map/scalar, list/scalar or list/map types.

In theory, Drill is a relational engine. The relational model requires known, fixed column types. Drill is schema-free and so tries to omit knowledge of column as late as possible. Further, Drill is distributed, and there is no way for fragment A to know what types that fragment B may be reading, so fragment a must "guess" and typically guesses nullable int.

The proposal here is to deprecate attempts to handle schema change. Redefine support as:

* Premutated column order is supported.
* Missing columns are not supported. (Or are supported only when we provide the user a way to define the type of the missing column, and provide a default value.)
* Type negotiation is allowed only at the reader level.
* Within Drill operators, all batches must have the same set of columns and column types, or the query fails.

h4. Schema Change and JSON

The schema change issue is raised here because JSON, by its very nature, is the leading cause of schema change issues in Drill. The text above discussed how JSON causes schema changes. The problems are multiplied when distributed: each fragment may read different versions of the application-specific JSON format (with different fields or types), and each must guess the type of missing fields, without knowing what other fragments have discovered.

By deprecating schema change handling, it becomes clear that all JSON files must declare their schema via their data layout, and so the layout must be the same in all files. (Use ETL to Parquet if the format changes.)

Or, it becomes clear that to get a consistent schema, the user must declare the schema somehow and the reader must honor that format.

> Simplify JSON input format
> --------------------------
>
>                 Key: DRILL-6062
>                 URL: https://issues.apache.org/jira/browse/DRILL-6062
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Paul Rogers
>
> DRILL-6035 defines the limitations with Drill's 1.12 and 1.13 JSON readers. Many of these limitations are due to the difficulty of mapping arbitrary JSON documents into a relational model. Drill has many ad-hoc, partial solutions, but those do not provide complete, production-quality solutions.
> Solutions for full JSON schema mapping are likely beyond what Drill can (or should) achieve. This ticket suggests we take a different, more realistic approach and simply acknowledge that Parquet is the best format for Drill, while providing minimal (but solid) JSON support.
> h4. Redefine Drill's Target Data Model
> Change the Drill web site to explain that Parquet is Drill's target data model. Drill supports other formats to the degree that they mimic (a subset of) Parquet.
> More specifically:
> * Drill is a relational, columnar engine.
> * Each Drill column must have a single, known data type.
> * Drill arrays cannot contain null values.
> * Drill supports maps (Parquet structs) and repeated maps
> * Drill assumes that the file schema is the same across all files in a data set.
> As it turns out, this is exactly the Parquet model.
> h4. Redefine Drill's JSON Support
> Given the above, redefine the JSON that Drill support to that which follows the Parquet model. Drill provides no external schema. Instead, the JSON must be structured to provide a single, clear mapping from the JSON to Drill's internal Parquet format, with no ambiguities:
> * Every file consists of a fixed set of objects.
> * Lists of scalars (without nulls) or objects.
> * Single, consistent type for each name/value pair.
> * No null values. (For key/value pairs, omit the pair if the value is null.)
> * No empty files.
> Of particular concern are files with high "null density": many nulls without declaring a type. Drill cannot effectively support such files.
> h4. External ETL for Non-Compliant JSON
> Rather than either a) invest in JSON mapping, or b) allow queries to fail, Drill should encourage the use of external ETL tools to convert non-compliant JSON into Parquet files. Since most JSON is ad-hoc, created by and for specific applications, this means most JSON should pass through an ETL layer into Parquet before being used with Drill.
> h4. Simplify the JSON Reader
> The JSON reader today attempts to use many partial, ad-hoc fixes to work around some JSON ambiguity. These hacks are hard to test and maintain, requiring effort that would be better invested elsewhere. Once we adopt Parquet as the reference format, and define the small, simpler form of JSON, we can remove the hacks:
> * Drop support for unions. (Unions are poorly supported and very complex.)
> * Drop support for the {{ListVector}} (which is, essentially, a list of unions and does not even work.)
> * Drop support for multi-dimensional lists. (These do not have any well-defined mapping to relational tables.)
> * Drop support for leading nulls that span batches. (That is, the type of every value must be revealed within the first batch.)
> * Drop support for empty files. (Drill needs a schema internally. Drill invents a fake schema today, but that just causes a schema change later. If desired, simply ignore such files rather than failing the query.)
> h4. Implications for Drill 1.13 "Result Set Loader"
> Much work was done to try to extend the result set loader to handle JSON ambiguities. The List Vector, Repeated List Vector and Union Vectors were all implemented, leading to a vast increase in complexity. If we adopt the above, this work can be backed out, resulting in a smaller, more efficient, streamlined core. In short, remove the poorly-supported components used only by JSON, keeping the types and mechanisms needed for Parquet (and Drill's internal operators.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)