You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2020/03/12 23:13:16 UTC

[GitHub] [drill] paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader

paul-rogers opened a new pull request #2023: DRILL-7640: EVF-based JSON Loader
URL: https://github.com/apache/drill/pull/2023

# [DRILL-7640](https://issues.apache.org/jira/browse/DRILL-7640): EVF-based JSON Loader

## Description

Builds on the JSON structure parser and several other PRs to provide an enhanced, robust mechanism to read JSON data into value vectors via the EVF. This is not the JSON reader, rather it is the "V2" version of the JsonProcessor which does the actual JSON parsing/loading work.

This PR is a partial do-over of an earlier PR, DRILL-6953. This PR contains only the lower-level JSON loader. A new PR for DRILL-6953 will follow which will add the JSON reader and fix any compatibility issues. Doing the PR separately reduces the size of each PR, making them easier to review and manage.

The concept is that we already have the JSON structure parser (thanks to those that reviewed those PRs!). The structure parser emits *events* to *listeners*. This PR implements the listeners which use EVF to write values to value vectors.

Some new features provided by the JSON loader include:

* Built in conversion from (almost) any JSON type to (almost) any other type. If a field starts a String, and shifts to Number, Drill will continue to write the value as `VARCHAR`.
* Allow runs of nulls (`null`) and/or empty arrays (`[ ]`). Defer type selection until the first actual value appears. (Or, force selection of `Nullable VARCHAR` if the batch ends without seeing any type.)
* An array with null entries `[ null ]` forces type selection to `Nullable VARCHAR` since we must count the null entries.
* Support for a provided schema. If the input is ambiguous, or inconsistent, the provided schema states the desired column type; the JSON loader converts data to that type.
* The provided schema allows "text mode" selection per-column. The JSON loader supports the "all text mode" setting from before, but now allows "text mode" per column for only those columns which are a problem.
* The provided schema also allows a new "JSON mode": the field, and all its children, are read as JSON text. That is, if the JSON is `{a: {b: ["foo", 10, true]}}`, and column `a` is read as JSON, then the value of `a` is `"{b: ["foo", 10, true]}"`.

We can expect a number of tweaks and adjustments to be needed in a later PR when we get the existing tests to pass with the new JSON format plugin. The goal here is to simply get the bulk of the work reviewed separately.

To be clear, in this PR, nothing other than unit tests uses this new code.

This PR contains a few changes which duplicate those in the PR for DRILL-7633. Once DRILL-7633 is merged, this PR will be rebased and the overlapping changes will disappear.

## Documentation

Once the JSON format plugin PR is submitted, users can use the provided column properties mentioned above. This PR does not yet expose them, so no documentation is needed yet.

## Testing

Added unit tests for all of the cases and features described above. Also tests failure cases (such as JSON with inconsistent types which the JSON loader cannot handle.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services