You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by KulykRoman <gi...@git.apache.org> on 2016/09/06 08:58:39 UTC

[GitHub] drill pull request #580: DRILL-4824: JSON with complex nested data produces ...

GitHub user KulykRoman opened a pull request:

    https://github.com/apache/drill/pull/580

    DRILL-4824: JSON with complex nested data produces incorrect output w\u2026

    \u2026ith missing fields

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KulykRoman/drill DRILL-4824

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/580.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #580
    
----
commit 43919c0f8e447a8d76d8726fe72cd3c10edf78b4
Author: Roman Kulyk <ro...@gmail.com>
Date:   2016-08-03T12:08:40Z

    DRILL-4824: JSON with complex nested data produces incorrect output with missing fields
    
    - Added changes to skip empty Lists or Maps.
    - Changed TestJsonReader.testSplitAndTransferFailure() and TestJsonReader.testFieldSelectionBug() according to new output logic.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #580: DRILL-4824: JSON with complex nested data produces ...

Posted by Serhii-Harnyk <gi...@git.apache.org>.
Github user Serhii-Harnyk commented on a diff in the pull request:

    https://github.com/apache/drill/pull/580#discussion_r98232198
  
    --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/MapVector.java ---
    @@ -317,6 +317,12 @@ public Object getObject(int index) {
             if (v != null && index < v.getAccessor().getValueCount()) {
               Object value = v.getAccessor().getObject(index);
               if (value != null) {
    +            if ((v.getAccessor().getObject(index) instanceof Map
    +                    && ((Map) v.getAccessor().getObject(index)).size() == 0)
    +                || (v.getAccessor().getObject(index) instanceof List
    +                    && ((List) v.getAccessor().getObject(index)).size() == 0)) {
    +              continue;
    +            }
    --- End diff --
    
    @paul-rogers, map fields have data mode required and they are the part of the schema, that's why there are no difference between missing field in some record, and the field that exists but empty. 
    This fix for your example will return result
    `{"a":{"b":10},"c":[1,2,3]}`
    `{"c":[4]}`
    `{"b":[5]}`
    `{"a":{"b":20}}`
    `{"a":{"b":20}}`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #580: DRILL-4824: JSON with complex nested data produces ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/580#discussion_r100001448
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/vector/complex/writer/TestJsonReader.java ---
    @@ -91,9 +91,7 @@ public void testFieldSelectionBug() throws Exception {
               .baselineColumns("col_1", "col_2")
               .baselineValues(
                   mapOf(),
    -              mapOf(
    -                  "inner_1", listOf(),
    -                  "inner_3", mapOf()))
    +              mapOf())
    --- End diff --
    
    The test file used here is not sufficient to fully test the code change. Perhaps create an additional test, using a new test file, that tests all the variations suggested in the comment below and in the JIRA entry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #580: DRILL-4824: JSON with complex nested data produces ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/580#discussion_r78484822
  
    --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/MapVector.java ---
    @@ -317,6 +317,12 @@ public Object getObject(int index) {
             if (v != null && index < v.getAccessor().getValueCount()) {
               Object value = v.getAccessor().getObject(index);
               if (value != null) {
    +            if ((v.getAccessor().getObject(index) instanceof Map
    +                    && ((Map) v.getAccessor().getObject(index)).size() == 0)
    +                || (v.getAccessor().getObject(index) instanceof List
    +                    && ((List) v.getAccessor().getObject(index)).size() == 0)) {
    +              continue;
    +            }
    --- End diff --
    
    Does this handle the difference beteen a missing field, and a field that exists, but contains an empty map or list? Examples:
    
        { "a" : { "b" : 10 }, "c" : [ 1, 2, 3 ] }  // Baseline case
        { "a" : { }, "c" : [ 4 ] }                       // Keep "a"?
        { "b" : [ 5 ] }                                     // Remove a: OK
        { "a" : { "b": 20 }, "c" : [ ] }             // Keep "c"?
        { "a" : { "b": 20 } }                           // Remove c: OK
    
    That is, should we diffentiate between a empty map/list and a non-existent one?
    
    The code seems to discard all empty maps & lists. Should the check actually be done in the parser to not pass along empty items? Is this possible (I'm on thin ice here in my detailed knowledge of value vectors...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #580: DRILL-4824: JSON with complex nested data produces ...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/580#discussion_r99970247
  
    --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/MapVector.java ---
    @@ -317,6 +317,12 @@ public Object getObject(int index) {
             if (v != null && index < v.getAccessor().getValueCount()) {
               Object value = v.getAccessor().getObject(index);
               if (value != null) {
    +            if ((v.getAccessor().getObject(index) instanceof Map
    +                    && ((Map) v.getAccessor().getObject(index)).size() == 0)
    +                || (v.getAccessor().getObject(index) instanceof List
    +                    && ((List) v.getAccessor().getObject(index)).size() == 0)) {
    +              continue;
    +            }
    --- End diff --
    
    See the JIRA entry for more comments. The key problem is that Drill does not support standard JSON rules. So, we are simply moving the problem around.
    
    Consider this case:
    
    {code}
    { }
    { "a": { } }
    { "a": { "b": {} } }
    {code}
    
    It appears that the code here would emit:
    { }
    { }
    { "a": { } }
    {code}
    
    If we are going to apply the empty-is-not-present rule we should do so recursively.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---