You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/19 00:48:04 UTC
[jira] [Commented] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

    [ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016738#comment-16016738 ] 

Paul Rogers commented on DRILL-4824:
------------------------------------

Additional thoughts as we again look at this bug.

The problem is not in the reader itself, it is in how Drill represents JSON.

To fix this, we’d have to allow multiple null states. To do that, we’d have to adjust how we represent nulls, which has its own set of issues. See earlier comments.

Today, the “isSet” (bit) vector is 0 for null, 1 for set. To allow multiple nulls states, we need semantics that say 0 = set, non-zero is null. Then, 0x01 is plain old null, 0x03 could indicate null-and-unset.

Then, the reader (actually the mutator) would have to fill in the proper null value for missing fields. That is, when we write record 100 (say), we’d notice that we’ve not written a value for column x since record 95, so we’d fill in the “missing” values with the null-and-not-set value.

Today, we can just rely on the default value of 0 to indicate null. But, for variable-width columns, we do have to back-fill the offset vectors, so we could do the same logic for all nullable types.

Once we have the two forms of null flags, then the JSON writer can do the right thing. If just null, emit “foo: null”. If null-and-unset, skip emitting the field.

The result is that we should be able to scan, then CTAS a JSON file and get semantically the same output as the input (without removing null fields and without inserting nulls for missing fields — our only two choices today.)

The work to prevent memory fragmentation is creating a new "size-aware" mutator (vector writer). We can easily extend that work to handle the two null cases.

But, the big project is changing the “polarity” of null: doing so requires inspecting all code.

One other related improvement has to do with variable width columns. Today, we have an inefficiency, we need the data vector, the offset vector and the null (bit) vector. As the result of my changes, no vector can be larger than 16 MB, which means no offset can be larger than 16 MB. This is 0xFD_8000. We store offsets as ints, maximum value of 0xFFFF_FFFF. This means we can play a very simple trick: use bits 29 and 30 (or 30 and 31 if we don’t mind negatives) to hold the bit flags and simply omit the bit vector. That immediately saves 64K memory per nullable VarChar per batch.

And if we change the offset vectors, we should change the semantics from “store the start pointer” to “store the end pointer.” That is, instead of:

\[0, 10, 20, 30]

To store three 10-byte strings, use:

\[10, 20, 30]

So we save four bytes, no big deal, right? Actually, Boaz realized that we get hit by the power-of-two rounding. He has hash tables of 64K entries. Because we need 64K + 1 entries in the offset vector, we actually allocate 128K offsets, resulting in a waste of 256K to store those extra four bytes in a 64K batch.

All this points out that the JSON fix is not trivial; that’s why the original PR didn’t take progress. We have to fix some fundamentals first to lay the groundwork.

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)