You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Serhii-Harnyk <gi...@git.apache.org> on 2016/12/29 16:08:05 UTC

[GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...

GitHub user Serhii-Harnyk opened a pull request:

    https://github.com/apache/drill/pull/713

    DRILL-3562: Query fails when using flatten on JSON data where some do\u2026

    \u2026cuments have an empty array
    1. Added set for ListWriters tracking to keep empty arrays for further initializing in ensureAtLeastOneField method. 
    2. Added check to avoid schema generating with field type "Late" and mode "Optional", replaced it to "Int" type in FlattenRecordBatch class.
    3. Added unit tests to cover cases querying Json with empty arrays with flatten.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Serhii-Harnyk/drill DRILL-3562

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/713.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #713
    
----
commit 815de0cc6f16b247b3a655007241a074e38394c7
Author: Serhii-Harnyk <se...@gmail.com>
Date:   2016-12-20T16:55:41Z

    DRILL-3562: Query fails when using flatten on JSON data where some documents have an empty array

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/713


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...

Posted by Serhii-Harnyk <gi...@git.apache.org>.

Github user Serhii-Harnyk commented on a diff in the pull request:

    https://github.com/apache/drill/pull/713#discussion_r97587774
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/JsonReader.java ---
    @@ -59,6 +59,12 @@
       private final boolean readNumbersAsDouble;
     
       /**
    +   * Collection for tracking empty array writers during reading
    +   * and storing them for initializing empty arrays
    +   */
    +  private final Set<ListWriter> emptyArrayWritersSet = Sets.newHashSet();
    --- End diff --
    
    But it should be mentioned that this list may contain many duplicates, as every ListWriter from every empty array field will be added into it. 
    To avoid secondary call field.integer() on the same ListWriter object, which was added many times into the list, in the method JsonReader.ensureAtLeastOneField() we check valueCapacity of every ListWriter. 
    Also LinkedHashSet may be used there, but in the case of many empty array fields it may cause additional extra cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...

Posted by Serhii-Harnyk <gi...@git.apache.org>.

Github user Serhii-Harnyk commented on a diff in the pull request:

    https://github.com/apache/drill/pull/713#discussion_r97576603
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/JsonReader.java ---
    @@ -59,6 +59,12 @@
       private final boolean readNumbersAsDouble;
     
       /**
    +   * Collection for tracking empty array writers during reading
    +   * and storing them for initializing empty arrays
    +   */
    +  private final Set<ListWriter> emptyArrayWritersSet = Sets.newHashSet();
    --- End diff --
    
    @amansinha100 Yes, you are right, in this place should be used List. Fixed it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/713#discussion_r94814315
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java ---
    @@ -305,12 +306,23 @@ protected boolean setupNewSchema() throws SchemaChangeException {
     
         final NamedExpression flattenExpr = new NamedExpression(popConfig.getColumn(), new FieldReference(popConfig.getColumn()));
         final ValueVectorReadExpression vectorRead = (ValueVectorReadExpression)ExpressionTreeMaterializer.materialize(flattenExpr.getExpr(), incoming, collector, context.getFunctionRegistry(), true);
    -    final TransferPair tp = getFlattenFieldTransferPair(flattenExpr.getRef());
    -
    -    if (tp != null) {
    -      transfers.add(tp);
    -      container.add(tp.getTo());
    -      transferFieldIds.add(vectorRead.getFieldId().getFieldIds()[0]);
    +    final FieldReference fieldReference = flattenExpr.getRef();
    +    final TransferPair transferPair = getFlattenFieldTransferPair(fieldReference);
    +
    +    if (transferPair != null) {
    +      final ValueVector flattenVector = transferPair.getTo();
    +
    +      // checks that list has only default ValueVector and replaces resulting ValueVector to INT typed ValueVector
    +      if (exprs.size() == 0 && flattenVector.getField().getType().equals(Types.LATE_BIND_TYPE)) {
    +        final MaterializedField outputField = MaterializedField.create(fieldReference.getAsNamePart().getName(), Types.OPTIONAL_INT);
    +        final ValueVector vector = TypeHelper.getNewVector(outputField, oContext.getAllocator());
    --- End diff --
    
    The fix appears to be to transform an empty list into an empty list of integers. That is, Drill does not have the concept of "empty list", only "empty list of type X" and we are guessing the type to be integer.
    
    We've had issues elsewhere in the product where such guesses turn out to be wrong. Perhaps the next row/batch has a non-empty list, but of strings. Or worse, of objects (maps.) Downstream operators cannot handle this.
    
    The result is that a query fails for no better reason than we caused it to fail by guessing the wrong type.
    
    Clearly, fixing the broader problem is beyond the scope of this fix. I am pointing out, however, that a consequence of the assumptirnmade here is that some queries, somewhere later, will fail due to an artificial schema change.
    
    The correct solution is to introduce an "Unknown" type and mark this a vector of type "Unknown". All we know is that it is a list; the member types are unknown. Then, in downstream operators, when we encounter a schema change, we know that an empty list of "Unknown" type is compatible with a list of any other type (say maps.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #713: DRILL-3562: Query fails when using flatten on JSON data wh...

Posted by arina-ielchiieva <gi...@git.apache.org>.

Github user arina-ielchiieva commented on the issue:

    https://github.com/apache/drill/pull/713
  
    Looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #713: DRILL-3562: Query fails when using flatten on JSON data wh...

Posted by amansinha100 <gi...@git.apache.org>.

Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/713
  
    LGTM.  +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #713: DRILL-3562: Query fails when using flatten on JSON data wh...

Posted by Serhii-Harnyk <gi...@git.apache.org>.

Github user Serhii-Harnyk commented on the issue:

    https://github.com/apache/drill/pull/713
  
    @amansinha100, could you please review new changes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #713: DRILL-3562: Query fails when using flatten on JSON ...

Posted by amansinha100 <gi...@git.apache.org>.

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/713#discussion_r95100719
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/JsonReader.java ---
    @@ -59,6 +59,12 @@
       private final boolean readNumbersAsDouble;
     
       /**
    +   * Collection for tracking empty array writers during reading
    +   * and storing them for initializing empty arrays
    +   */
    +  private final Set<ListWriter> emptyArrayWritersSet = Sets.newHashSet();
    --- End diff --
    
    Any reason why this needs to be HashSet rather than just a List ?  The HashSet may change the insertion order, so if you had 2 or more empty arrays in the same Json doc within one list,  the output of Flatten could end up changing the order.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---