You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/07/09 00:21:01 UTC

[jira] [Commented] (DRILL-4824) Null maps / lists and non-provided state support for JSON fields. Numeric types promotion.

    [ https://issues.apache.org/jira/browse/DRILL-4824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079386#comment-16079386 ] 

Paul Rogers commented on DRILL-4824:
------------------------------------

Turns out there is a simple (though inefficient) solution to the null/is-set issue: just add another "bits" vector.

The existing "bits" vector indicates if the value is set (really, is set and is null.) Add another vector which identifies if the value is set (to null) or is unset. This alternative can be backward compatible, but as a result, the semantics are rather convoluted.

The "bits" vector remains 1 if the (non-null) value is set, 0 if the value is null (which, in current Drill, is the same as not set.)

The new "bits2" vector is 1 if the value is JSON-unset, 0 if null-unset.

Here "JSON-unset" means that JSON-aware operators should consider the value to be unset. All JSON-unaware operators just look at the existing "bits" vector for the combined unset/null state.

In short:

|| State || bits value || bits2 value ||
| Set to non-NULL value | 1 | N/A |
| Explicitly null | 1 | 0 |
| Explicitly unset | 1 | 1 |
| Drill NULL | 0 | N/A |

Here, "Drill null" means the existing Drill meaning of NULL: unset or explicitly null.

Given this definition, existing code (including the JDBC drivers) can just ignore "bits2" and work fine. Only the JSON reader, and the JSON writer, will know how to interpret "bits2." With the definition above, a missing "bits2" can be interpreted as if "bits2" were present, but filled with zeros.

The above is, admittedly, a hack. (Recall that the "bits" vector isn't: it is actually bytes, so we'd now be using 16 bits to encode three states, which is a huge waste.)

We'd still want to move to the full solution explained earlier. To do that, we'd want ensure that all accesses to "bits" and "bits2" occur though methods on the vector classes. Once this is done, we can swap out implementations for the more compact, single-vector version. (We'd also need a solution for older JDBC drivers already deployed in the field: this gets us back to the client version number issue...)

> Null maps / lists and non-provided state support for JSON fields. Numeric types promotion.
> ------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4824
>                 URL: https://issues.apache.org/jira/browse/DRILL-4824
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - JSON
>    Affects Versions: 1.0.0
>            Reporter: Roman
>            Assignee: Volodymyr Vysotskyi
>
> There is incorrect output in case of JSON file with complex nested data.
> _JSON:_
> {code:none|title=example.json|borderStyle=solid}
> {
>         "Field1" : {
>         }
> }
> {
>         "Field1" : {
>                 "InnerField1": {"key1":"value1"},
>                 "InnerField2": {"key2":"value2"}
>         }
> }
> {
>         "Field1" : {
>                 "InnerField3" : ["value3", "value4"],
>                 "InnerField4" : ["value5", "value6"]
>         }
> }
> {code}
> _Query:_
> {code:sql}
> select Field1 from dfs.`/tmp/example.json`
> {code}
> _Incorrect result:_
> {code:none}
> +---------------------------+
> |          Field1           |
> +---------------------------+
> {"InnerField1":{},"InnerField2":{},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{"key1":"value1"},"InnerField2" {"key2":"value2"},"InnerField3":[],"InnerField4":[]}
> {"InnerField1":{},"InnerField2":{},"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}
> Theres is no need to output missing fields. In case of deeply nested structure we will get unreadable result for user.
> _Correct result:_
> {code:none}
> +--------------------------+
> |         Field1           |
> +--------------------------+
> |{}                                                                     
> {"InnerField1":{"key1":"value1"},"InnerField2":{"key2":"value2"}}
> {"InnerField3":["value3","value4"],"InnerField4":["value5","value6"]}
> +--------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)