You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/08 00:49:42 UTC
[jira] [Comment Edited] (DRILL-4824) JSON with complex nested data produces incorrect output with missing fields

    [ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857131#comment-15857131 ] 

Paul Rogers edited comment on DRILL-4824 at 2/8/17 12:49 AM:
-------------------------------------------------------------

[~sharnyk], with your change, the expected result would be the following (ignore spacing):

{code}
{ }
{ "Field1" : { "InnerField2": {"key2":"value2"} } }
{code}

What the user probably actually expects is:

{code}
{ "Field1" : { } }
{ "Field1" : { "InnerField2": {"key2":"value2"} } }
{code}

To understand why, we need to review why Drill violates the JSON standard. The following four cases area all different in JSON:

{code}
{ "foo": null }
{ "foo": {} }
{ "foo": [] }
{ }
{code}

That is, a field can be:

* Null
* A map (possibly empty)
* An array (possibly empty)
* Not present

We do not support:

* Null map (not a JSON concept, in JSON anything null is still null.)
* Null array (as above)
* Not present

So we try to force fit these cases into what we do support. It used to be that we handle "not present" to mean "present but null or empty." Your change says that an empty array or empty map is the same as not present. And, of course, we still don't support null maps or arrays. We treat them as empty, and so, now, as not present.

This is quite a difficult situation!

So, we seem to be adding rules that says:

* If a map or array is empty, omit it from the output.
* If a map or array contains only omitted members, then omit the outer map or array.

This means that all the following inputs produce the same empty output:

{code}
{ }
{ "field1": { } }
{ "field1": { "field1a": { } }
{ "field1": { "field1a": { "field1aA": { } } }
{ "field1": { "field1b": [ ] }
{ "field1": { "field1c": null }
{ "field2": [ ] }
{ "field2": [ { } ] }
{ "field2": [ { }, { } ] }
{ "field3": [ null ] }
{ "field4": null }
{code}

Your fix might help some users, but it is not proper JSON, so it will end up breaking behavior for other users.

One way to resolve this is with a session option. This is not a proper solution, it just pushes the problem onto the user.

{code}
ALTER SESSION SET something.json.hide-empty = true
{code}

Provides your behavior, while:

{code}
ALTER SESSION SET something.json.hide-empty = false
{code}

Keeps the old behavior.

What we really need is:

{code}
ALTER SESSION SET something.json.follow-the-standard = true
{code}

But, sadly, following the standard requires work far beyond the scope of this bug fix.


was (Author: paul-rogers):
[~sharnyk], with your change, the expected result would be the following (ignore spacing):

{code}
{ "Field1" : { } }
{ "Field1" : { "InnerField2": {"key2":"value2"} } }
{code}

What the user probably actually expects is:

{code}
{ }
{ "Field1" : { "InnerField2": {"key2":"value2"} } }
{code}

To understand why, we need to review why Drill violates the JSON standard. The following four cases area all different in JSON:

{code}
{ "foo": null }
{ "foo": {} }
{ "foo": [] }
{ }
{code}

That is, a field can be:

* Null
* A map (possibly empty)
* An array (possibly empty)
* Not present

We do not support:

* Null map (not a JSON concept, in JSON anything null is still null.)
* Null array (as above)
* Not present

So we try to force fit these cases into what we do support. It used to be that we handle "not present" to mean "present but null or empty." Your change says that an empty array or empty map is the same as not present. And, of course, we still don't support null maps or arrays. We treat them as empty, and so, now, as not present.

This is quite a difficult situation!

So, we seem to be adding rules that says:

* If a map or array is empty, omit it from the output.
* If a map or array contains only omitted members, then omit the outer map or array.

This means that all the following inputs produce the same empty output:
{ }
{ "field1": { } }
{ "field1": { "field1a": { } }
{ "field1": { "field1a": { "field1aA": { } } }
{ "field1": { "field1b: [ ] }
{ "field1": { "field1c": null }
{ "field2": [ ] }
{ "field2": [ { } ] }
{ "field2": [ { }, { } ] }
{ "field3": [ null ] }
{ "field4": null }
{code}

Your fix might help some users, but it is not proper JSON, so it will end up breaking behavior for other users.

One way to resolve this is with a session option. This is not a proper solution, it just pushes the problem onto the user.

{code}
ALTER SESSION SET something.json.hide-empty = true
{code}

Provides your behavior, while:

{code}
ALTER SESSION SET something.json.hide-empty = false
{code}

Keeps the old behavior.

What we really need is:

{code}
ALTER SESSION SET something.json.follow-the-standard = true
{code}

But, sadly, following the standard requires work far beyond the scope of this bug fix.

> JSON with complex nested data produces incorrect output with missing fields
> ---------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Serhii Harnyk
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)