You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/06/22 07:03:00 UTC
[jira] [Created] (DRILL-7304) Filter record batch misses schema changes within maps

Paul Rogers created DRILL-7304:
----------------------------------

             Summary: Filter record batch misses schema changes within maps
                 Key: DRILL-7304
                 URL: https://issues.apache.org/jira/browse/DRILL-7304
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.16.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


While testing the new row-set based JSON reader, it was found that the Filter record batch does not properly handle a change of schema within a map. Consider this test:

{code:java}
  @Test
  public void adHoc() {
    try {
      client.alterSession(ExecConstants.JSON_ALL_TEXT_MODE, true);
      String sql = "select a from dfs.`jsoninput/drill_3353` where e = true";
      RowSet results = runTest(sql);
      results.print();
      results.clear();
    } finally {
      client.resetSession(ExecConstants.JSON_ALL_TEXT_MODE);
    }
  }
{code}

The "drill_3353" directory contains two files. {{a.json}}:

{code:json}
{ a : { b : 1, c : 1 }, e : false } 
{ a : { b : 1, c : 1 }, e : false } 
{ a : { b : 1, c : 1 }, e : true  } 
{code}

And {{b.json}}:

{code:json}
{ a : { b : 1, d : 1 }, e : false } 
{ a : { b : 1, d : 1 }, e : false } 
{ a : { b : 1, d : 1 }, e : true  } 
{code}

Notice that both files contain the field {{a.b}}, but the first contains {{a.c}} while the second contains {{a.d}}.

The test is configured to return the schema of each file without any "schema smoothing." That is, there is a hard schema change between files.

The filter record batch fails to notice the schema change, trues to use the {{b.json}} schema with {{a.json}}, and results in an exception due to invalid offset vectors.

The problem is a symptom of this code:

{code:java}
  protected boolean setupNewSchema() throws SchemaChangeException {
    ...
    switch (incoming.getSchema().getSelectionVectorMode()) {
      case NONE:
        if (sv2 == null) {
          sv2 = new SelectionVector2(oContext.getAllocator());
        }
        filter = generateSV2Filterer();
        break;
    ...
    if (container.isSchemaChanged()) {
      container.buildSchema(SelectionVectorMode.TWO_BYTE);
      return true;
    }
{code}

That is, if, after calling {{generateSV2Filterer()}}, the schema of our outgoing container changes, rebuild the schema. Since we changed a map structure, the schema should be changed, but it is not.

Digging deeper, the following adds/gets each incoming field to the outgoing container:

{code:java}
  protected Filterer generateSV2Filterer() throws SchemaChangeException {
    ...
    for (final VectorWrapper<?> v : incoming) {
      final TransferPair pair = v.getValueVector().makeTransferPair(container.addOrGet(v.getField(), callBack));
      transfers.add(pair);
    }
{code}

Now, since the top-level field {{a}} already exists in the container, we'd have to do a bit of sleuthing to see if its contents changed, but we don't:

{code:java}
  public <T extends ValueVector> T addOrGet(final MaterializedField field, final SchemaChangeCallBack callBack) {
    final TypedFieldId id = getValueVectorId(SchemaPath.getSimplePath(field.getName()));
    final ValueVector vector;
    if (id != null) {
      vector = getValueAccessorById(id.getFieldIds()).getValueVector();
      if (id.getFieldIds().length == 1 && !vector.getField().getType().equals(field.getType())) {
        final ValueVector newVector = TypeHelper.getNewVector(field, this.getAllocator(), callBack);
        replace(vector, newVector);
        return (T) newVector;
      }
    ...
{code}

The logic is to check if we have a vector of the top-level name. If so, we check if the types are equal. if not, we go ahead and replace the existing vector (which has {{a\{b,d\}}}) with the new one (which has {{a\{b, c\}}}).

However, when running the code, the {{if}}-statement is not triggered, so the vector is not replaced, and the schema is not marked as changed.

The next question is why {{isEquals()}} considers the two maps the same. It seems that the Protobuf-generated code does not handle the child types, ignoring differences in child types.

As it turns out, prior work already ran into this issue in another context, and a solution is available: {{MaterializedField.isEquivalent()}} already does the proper checks, including proper type checking for types of Unions and member checking for maps.

Replacing {{MajorType.equals()}} with {{MaterializedField.isEquivalent()}} fixes the issue.

Digging deeper, the reason that {{isEquals()}} says that the two types are equals is that they are the same object. The input container evolved its map type as the JSON reader discovered new columns. But, the filter record batch simply reused the same object. That is, since the two containers share the same {{MaterializedField}}, changes made by the incoming operator immediately changed the schema of the filter operator's container. Said another way: there is no way to detect changes to the contents of maps because the two operators share metadata.

This problem may be due to the original assumption that {{MaterializedField}} is immutable. That is true for "scalar" types (Nullable Int, say) but it is *not* true for complex types.

It appears that this is a design flaw. The best we an do now is hack around it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)