You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2019/11/30 23:38:00 UTC
[jira] [Commented] (DRILL-6953) Merge row set-based JSON reader

    [ https://issues.apache.org/jira/browse/DRILL-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985452#comment-16985452 ] 

ASF GitHub Bot commented on DRILL-6953:
---------------------------------------

paul-rogers commented on pull request #1913: DRILL-6953: EVF-based version of the JSON reader
URL: https://github.com/apache/drill/pull/1913
 
 
   Reimplements the JSON reader on top of the EVF. Does not yet
   handle a provided schema. New JSON parser does not yet reflect
   any changes made to the "V1" JSON parser in the last year.
   Does not yet handle the Union and List-of-union types.
   Enabling those encountered many issues elsewhere in Drill.
   
   Provides more robust (but still limited) handling of JSON
   type ambigutities. Handles runs of nulls before the first
   non-null value (within the first batch.) Handles runs of
   empty arrays before the first non-empty array (again, within
   the first batch.) Handles the case where a null value turns out
   to be an object or array. Handles reasonable conversions between
   types.
   
   Handling ambiguities makes the new parser more complex than
   the "V1" version. The new one uses explict states for each
   kind of JSON object, where as the old one used implicit states
   expressed via if-statements, which can be a bit hard to follow
   as the states get more complex.
   
   The new "V2" JSON scan is controlled by a new option:
   store.json.enable_v2_reader, which is false by default in this
   PR.
   
   Adds a "projection type" to the column writer so that the
   JSON parser can receive a "hint" as to the expected type.
   The hint is from the form of the projected column: `a[0]`,
   `a.b` or just `a`.
   
   Reimplements a number of JSON tests to test both the original
   "V1" and the new "V2" versions of the JSON reader. Adds many
   new tests for the new features of the "V2" parser.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Merge row set-based JSON reader
> -------------------------------
>
>                 Key: DRILL-6953
>                 URL: https://issues.apache.org/jira/browse/DRILL-6953
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: Future
>
>
> The final step in the ongoing "result set loader" saga is to merge the revised JSON reader into master. This reader does two key things:
> * Demonstrates the prototypical "late schema" style of data reading (discover schema while reading).
> * Implements many tricks and hacks to handle schema changes while loading.
> * Shows that, even with all these tricks, the only true solution is to actually have a schema.
> The new JSON reader:
> * Uses an expanded state machine when parsing rather than the complex set of if-statements in the current version.
> * Handles reading a run of nulls before seeing the first data value (as long as the data value shows up in the first record batch).
> * Uses the result-set loader to generate fixed-size batches regardless of the complexity, depth of structure, or width of variable-length fields.
> While the JSON reader itself is helpful, the key contribution is that it shows how to use the entire kit of parts: result set loader, projection framework, and so on. Since the projection framework can handle an external schema, it is also a handy foundation for the ongoing schema project.
> Key work to complete after this merger will be to reconcile actual data with the external schema. For example, if we know a column is supposed to be a VarChar, then read the column as a VarChar regardless of the type JSON itself picks. Or, if a column is supposed to be a Double, then convert Int and String JSON values into Doubles.
> The Row Set framework was designed to allow inserting custom column writers. This would be a great opportunity to do the work needed to create them. Then, use the new JSON framework to allow parsing a JSON field as a specified Drill type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)