You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/07/04 05:51:42 UTC

[GitHub] drill pull request #866: Drill 5657: Implement size-aware result set loader

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/866

    Drill 5657: Implement size-aware result set loader

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5657

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/866.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #866
    
----
commit 2c95326e1108e9299d6742c0cf5c35b86605bbe6
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-04T05:43:40Z

    DRILL-5657: Revised column accessors

commit bb4a4ed8eadbaa339570e8adabc8709e3d393719
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-04T05:45:43Z

    Revisions to row set test utilities

commit bec870fa849fdc39a12a1d1d48d18f3a0c6e8371
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-04T05:48:14Z

    Result set loader implementation and tests

commit 0c8727f5adfa1ffd2e4be584ba7f967e72265661
Author: Paul Rogers <pr...@maprtech.com>
Date:   2017-07-04T05:48:47Z

    Collateral damage: other affected files

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133618019
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/LogicalTupleLoader.java ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.Collection;
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema.TupleColumnSchema;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.MaterializedField;
    +
    +/**
    + * Shim inserted between an actual tuple loader and the client to remove columns
    + * that are not projected from input to output. The underlying loader handles only
    + * the projected columns in order to improve efficiency. This class presents the
    + * full table schema, but returns null for the non-projected columns. This allows
    + * the reader to work with the table schema as defined by the data source, but
    + * skip those columns which are not projected. Skipping non-projected columns avoids
    + * creating value vectors which are immediately discarded. It may also save the reader
    + * from reading unwanted data.
    + */
    +public class LogicalTupleLoader implements TupleLoader {
    +
    +  public static final int UNMAPPED = -1;
    +
    +  private static class MappedColumn implements TupleColumnSchema {
    +
    +    private final MaterializedField schema;
    +    private final int mapping;
    +
    +    public MappedColumn(MaterializedField schema, int mapping) {
    +      this.schema = schema;
    +      this.mapping = mapping;
    +    }
    +
    +    @Override
    +    public MaterializedField schema() { return schema; }
    +
    +    @Override
    +    public boolean isSelected() { return mapping != UNMAPPED; }
    +
    +    @Override
    +    public int vectorIndex() { return mapping; }
    +  }
    +
    +  /**
    +   * Implementation of the tuple schema that describes the full data source
    +   * schema. The underlying loader schema is a subset of these columns. Note
    +   * that the columns appear in the same order in both schemas, but the loader
    +   * schema is a subset of the table schema.
    +   */
    +
    +  private class LogicalTupleSchema implements TupleSchema {
    +
    +    private final Set<String> selection = new HashSet<>();
    +    private final TupleSchema physicalSchema;
    +
    +    private LogicalTupleSchema(TupleSchema physicalSchema, Collection<String> selection) {
    +      this.physicalSchema = physicalSchema;
    +      this.selection.addAll(selection);
    +    }
    +
    +    @Override
    +    public int columnCount() { return logicalSchema.count(); }
    +
    +    @Override
    +    public int columnIndex(String colName) {
    +      return logicalSchema.indexOf(rsLoader.toKey(colName));
    +    }
    +
    +    @Override
    +    public TupleColumnSchema metadata(int colIndex) { return logicalSchema.get(colIndex); }
    +
    +    @Override
    +    public MaterializedField column(int colIndex) { return logicalSchema.get(colIndex).schema(); }
    +
    +    @Override
    +    public TupleColumnSchema metadata(String colName) { return logicalSchema.get(colName); }
    +
    +    @Override
    +    public MaterializedField column(String colName) { return logicalSchema.get(colName).schema(); }
    +
    +    @Override
    +    public int addColumn(MaterializedField columnSchema) {
    +      String key = rsLoader.toKey(columnSchema.getName());
    +      int pIndex;
    +      if (selection.contains(key)) {
    --- End diff --
    
    Removed this feature. Now use a case-insensitive map for the name space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133618560
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/TupleSetImpl.java ---
    @@ -0,0 +1,551 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.drill.common.types.TypeProtos.DataMode;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema;
    +import org.apache.drill.exec.physical.rowSet.impl.ResultSetLoaderImpl.VectorContainerBuilder;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.AllocationHelper;
    +import org.apache.drill.exec.vector.ValueVector;
    +import org.apache.drill.exec.vector.VectorOverflowException;
    +import org.apache.drill.exec.vector.accessor.impl.AbstractColumnWriter;
    +import org.apache.drill.exec.vector.accessor.impl.ColumnAccessorFactory;
    +
    +/**
    + * Implementation of a column when creating a row batch.
    + * Every column resides at an index, is defined by a schema,
    + * is backed by a value vector, and and is written to by a writer.
    + * Each column also tracks the schema version in which it was added
    + * to detect schema evolution. Each column has an optional overflow
    + * vector that holds overflow record values when a batch becomes
    + * full.
    + * <p>
    + * Overflow vectors require special consideration. The vector class itself
    + * must remain constant as it is bound to the writer. To handle overflow,
    + * the implementation must replace the buffer in the vector with a new
    + * one, saving the full vector to return as part of the final row batch.
    + * This puts the column in one of three states:
    + * <ul>
    + * <li>Normal: only one vector is of concern - the vector for the active
    + * row batch.</li>
    + * <li>Overflow: a write to a vector caused overflow. For all columns,
    + * the data buffer is shifted to a harvested vector, and a new, empty
    + * buffer is put into the active vector.</li>
    + * <li>Excess: a (small) column received values for the row that will
    --- End diff --
    
    The term "excess" is perhaps not as descriptive as we'd like...
    
    Consider a row with three columns: (a, b, c).
    
    We're on row 12,345. We write a value for column a. We then try to write b. But, b is a large VarChar and so were the previous b values. b overflows. We will write the value for b and (later) c to the new, look-ahead vectors. But, we also have to copy the "excess" value for a from the original a vector to the new look-ahead vector for a.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133617740
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/ResultSetLoader.java ---
    @@ -0,0 +1,170 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet;
    +
    +import org.apache.drill.exec.record.VectorContainer;
    +
    +/**
    + * Builds a result set (series of zero or more row sets) based on a defined
    + * schema which may
    + * evolve (expand) over time. Automatically rolls "overflow" rows over
    + * when a batch fills.
    + * <p>
    + * Many of the methods in this interface are verify that the loader is
    --- End diff --
    
    Fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133618354
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultVectorCache.java ---
    @@ -0,0 +1,181 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.drill.common.types.TypeProtos.MajorType;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Manages an inventory of value vectors used across row batch readers.
    + * Drill semantics for batches is complex. Each operator logically returns
    + * a batch of records on each call of the Drill Volcano iterator protocol
    + * <tt>next()</tt> operation. However, the batches "returned" are not
    + * separate objects. Instead, Drill enforces the following semantics:
    + * <ul>
    + * <li>If a <tt>next()</tt> call returns <tt>OK</tt> then the set of vectors
    + * in the "returned" batch must be identical to those in the prior batch. Not
    + * just the same type; they must be the same <tt>ValueVector</tt> objects.
    + * (The buffers within the vectors will be different.)</li>
    + * <li>If the set of vectors changes in any way (add a vector, remove a
    + * vector, change the type of a vector), then the <tt>next()</tt> call
    + * <b>must</b> return <tt>OK_NEW_SCHEMA</tt>.</ul>
    + * </ul>
    + * These rules create interesting constraints for the scan operator.
    + * Conceptually, each batch is distinct. But, it must share vectors. The
    + * {@link ResultSetLoader} class handles this by managing the set of vectors
    + * used by a single reader.
    + * <p>
    + * Readers are independent: each may read a distinct schema (as in JSON.)
    + * Yet, the Drill protocol requires minimizing spurious <tt>OK_NEW_SCHEMA</tt>
    + * events. As a result, two readers run by the same scan operator must
    + * share the same set of vectors, despite the fact that they may have
    + * different schemas and thus different <tt>ResultSetLoader</tt>s.
    + * <p>
    + * The purpose of this inventory is to persist vectors across readers, even
    + * when, say, reader B does not use a vector that reader A created.
    + * <p>
    + * The semantics supported by this class include:
    + * <ul>
    + * <li>Ability to "pre-declare" columns based on columns that appear in
    + * an explicit select list. This ensures that the columns are known (but
    + * not their types).</li>
    + * <li>Ability to reuse a vector across readers if the column retains the same
    + * name and type (minor type and mode.)</li>
    + * <li>Ability to flush unused vectors for readers with changing schemas
    + * if a schema change occurs.</li>
    + * <li>Support schema "hysteresis"; that is, the a "sticky" schema that
    + * minimizes spurious changes. Once a vector is declared, it can be included
    + * in all subsequent batches (provided the column is nullable or an array.)</li>
    + * </ul>
    + */
    +public class ResultVectorCache {
    +
    +  /**
    +   * State of a projected vector. At first all we have is a name.
    +   * Later, we'll discover the type.
    +   */
    +
    +  private static class VectorState {
    +    protected final String name;
    +    protected ValueVector vector;
    +    protected boolean touched;
    +
    +    public VectorState(String name) {
    +      this.name = name;
    +    }
    +
    +    public boolean satisfies(MaterializedField colSchema) {
    +      if (vector == null) {
    +        return false;
    +      }
    +      MaterializedField vectorSchema = vector.getField();
    +      return vectorSchema.getType().equals(colSchema.getType());
    +    }
    +  }
    +
    +  private final BufferAllocator allocator;
    +  private final Map<String, VectorState> vectors = new HashMap<>();
    +
    +  public ResultVectorCache(BufferAllocator allocator) {
    +    this.allocator = allocator;
    +  }
    +
    +  public void predefine(List<String> selected) {
    +    for (String colName : selected) {
    +      addVector(colName);
    +    }
    +  }
    +
    +  private VectorState addVector(String colName) {
    +    VectorState vs = new VectorState(colName);
    +    vectors.put(vs.name, vs);
    +    return vs;
    +  }
    +
    +  public void newBatch() {
    +    for (VectorState vs : vectors.values()) {
    +      vs.touched = false;
    +    }
    +  }
    +
    +  public void trimUnused() {
    --- End diff --
    
    Was one attempt to handle changing schemas by removing vectors that where created in one batch, but then not used in a later batch. The original design of some of Drill's code enforced the idea that required vectors are required. So, if a required vector is added, but not data written to it, what to do? The thought here was to remove the vector.
    
    Later, it became clear that the only workable solution is to support "fill-empties" even for required vectors: we must fill unused vectors with zeros. It then becomes up to the client to determine if this makes sense; the underlying framework no longer has an opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r130429208
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/TupleSchema.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet;
    +
    +import org.apache.drill.exec.physical.rowSet.impl.MaterializedSchema;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.MaterializedField;
    +
    +/**
    + * Defines the schema of a tuple: either the top-level row or a nested
    + * "map" (really structure). A schema is a collection of columns (backed
    + * by vectors in the loader itself.) Columns are accessible by name or
    + * index. New columns may be added at any time; the new column takes the
    + * next available index.
    + */
    +
    +public interface TupleSchema {
    +
    +  public interface TupleColumnSchema {
    +    MaterializedField schema();
    +
    +    /**
    +     * Report if a column is selected.
    +     * @param colIndex index of the column to check
    +     * @return true if the column is selected (data is collected),
    +     * false if the column is unselected (data is discarded)
    +     */
    +
    +    boolean isSelected();
    --- End diff --
    
    What does it mean for a  column to be selected? Selected in the query?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133618655
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/TupleSetImpl.java ---
    @@ -0,0 +1,551 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.drill.common.types.TypeProtos.DataMode;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema;
    +import org.apache.drill.exec.physical.rowSet.impl.ResultSetLoaderImpl.VectorContainerBuilder;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.AllocationHelper;
    +import org.apache.drill.exec.vector.ValueVector;
    +import org.apache.drill.exec.vector.VectorOverflowException;
    +import org.apache.drill.exec.vector.accessor.impl.AbstractColumnWriter;
    +import org.apache.drill.exec.vector.accessor.impl.ColumnAccessorFactory;
    +
    +/**
    + * Implementation of a column when creating a row batch.
    + * Every column resides at an index, is defined by a schema,
    + * is backed by a value vector, and and is written to by a writer.
    + * Each column also tracks the schema version in which it was added
    + * to detect schema evolution. Each column has an optional overflow
    + * vector that holds overflow record values when a batch becomes
    + * full.
    + * <p>
    + * Overflow vectors require special consideration. The vector class itself
    + * must remain constant as it is bound to the writer. To handle overflow,
    + * the implementation must replace the buffer in the vector with a new
    + * one, saving the full vector to return as part of the final row batch.
    + * This puts the column in one of three states:
    + * <ul>
    + * <li>Normal: only one vector is of concern - the vector for the active
    + * row batch.</li>
    + * <li>Overflow: a write to a vector caused overflow. For all columns,
    + * the data buffer is shifted to a harvested vector, and a new, empty
    + * buffer is put into the active vector.</li>
    + * <li>Excess: a (small) column received values for the row that will
    + * overflow due to a later column. When overflow occurs, the excess
    + * column value, from the overflow record, resides in the active
    + * vector. It must be shifted from the active vector into the new
    + * overflow buffer.
    + */
    +
    +public class TupleSetImpl implements TupleSchema {
    +
    +  public static class TupleLoaderImpl implements TupleLoader {
    +
    +    public TupleSetImpl tupleSet;
    +
    +    public TupleLoaderImpl(TupleSetImpl tupleSet) {
    +      this.tupleSet = tupleSet;
    +    }
    +
    +    @Override
    +    public TupleSchema schema() { return tupleSet; }
    +
    +    @Override
    +    public ColumnLoader column(int colIndex) {
    +      // TODO: Cache loaders here
    +      return tupleSet.columnImpl(colIndex).writer;
    +    }
    +
    +    @Override
    +    public ColumnLoader column(String colName) {
    +      ColumnImpl col = tupleSet.columnImpl(colName);
    +      if (col == null) {
    +        throw new UndefinedColumnException(colName);
    +      }
    +      return col.writer;
    +    }
    +
    +    @Override
    +    public TupleLoader loadRow(Object... values) {
    --- End diff --
    
    Type validation is done as the values wend their way through the writer tree. Eventually, we'll notice that the value is, say, Integer and call the `setInt()` method on the writer. If the writer is really for, say, a VarChar, then an unsupported operation exception will be thrown at that time. Similarly, if the object type is not one we know how to parse, then an exception will be thrown at that time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133617837
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/TupleSchema.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet;
    +
    +import org.apache.drill.exec.physical.rowSet.impl.MaterializedSchema;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.MaterializedField;
    +
    +/**
    + * Defines the schema of a tuple: either the top-level row or a nested
    + * "map" (really structure). A schema is a collection of columns (backed
    + * by vectors in the loader itself.) Columns are accessible by name or
    + * index. New columns may be added at any time; the new column takes the
    + * next available index.
    + */
    +
    +public interface TupleSchema {
    +
    +  public interface TupleColumnSchema {
    +    MaterializedField schema();
    +
    +    /**
    +     * Report if a column is selected.
    +     * @param colIndex index of the column to check
    +     * @return true if the column is selected (data is collected),
    +     * false if the column is unselected (data is discarded)
    +     */
    +
    +    boolean isSelected();
    --- End diff --
    
    Changed. Should be "isProjected". This means that the column (that, in this case, exists in a table), is projected into the result set. Assume a table has the schema
    
    (a, b, c)
    
    But the query is:
    
    ```SELECT c, a FROM ...```
    
    Then a and c are "projected", b is unprojected.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133618058
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultSetLoaderImpl.java ---
    @@ -0,0 +1,412 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.Collection;
    +
    +import org.apache.drill.common.exceptions.UserException;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.VectorContainer;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Implementation of the result set loader.
    + * @see {@link ResultSetLoader}
    + */
    +
    +public class ResultSetLoaderImpl implements ResultSetLoader, WriterIndexImpl.WriterIndexListener {
    +
    +  public static class ResultSetOptions {
    +    public final int vectorSizeLimit;
    +    public final int rowCountLimit;
    +    public final boolean caseSensitive;
    +    public final ResultVectorCache inventory;
    --- End diff --
    
    Renamed to vectorCache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133617874
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ColumnLoaderImpl.java ---
    @@ -0,0 +1,31 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +
    +/**
    + * Implementation interface for a column loader. Adds to the public interface
    + * a number of methods needed to coordinate batch overflow.
    + */
    +
    +public interface ColumnLoaderImpl extends ColumnLoader {
    --- End diff --
    
    Agreed. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131284158
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/LogicalTupleLoader.java ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.Collection;
    +import java.util.HashSet;
    +import java.util.List;
    +import java.util.Set;
    +
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema.TupleColumnSchema;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.MaterializedField;
    +
    +/**
    + * Shim inserted between an actual tuple loader and the client to remove columns
    + * that are not projected from input to output. The underlying loader handles only
    + * the projected columns in order to improve efficiency. This class presents the
    + * full table schema, but returns null for the non-projected columns. This allows
    + * the reader to work with the table schema as defined by the data source, but
    + * skip those columns which are not projected. Skipping non-projected columns avoids
    + * creating value vectors which are immediately discarded. It may also save the reader
    + * from reading unwanted data.
    + */
    +public class LogicalTupleLoader implements TupleLoader {
    +
    +  public static final int UNMAPPED = -1;
    +
    +  private static class MappedColumn implements TupleColumnSchema {
    +
    +    private final MaterializedField schema;
    +    private final int mapping;
    +
    +    public MappedColumn(MaterializedField schema, int mapping) {
    +      this.schema = schema;
    +      this.mapping = mapping;
    +    }
    +
    +    @Override
    +    public MaterializedField schema() { return schema; }
    +
    +    @Override
    +    public boolean isSelected() { return mapping != UNMAPPED; }
    +
    +    @Override
    +    public int vectorIndex() { return mapping; }
    +  }
    +
    +  /**
    +   * Implementation of the tuple schema that describes the full data source
    +   * schema. The underlying loader schema is a subset of these columns. Note
    +   * that the columns appear in the same order in both schemas, but the loader
    +   * schema is a subset of the table schema.
    +   */
    +
    +  private class LogicalTupleSchema implements TupleSchema {
    +
    +    private final Set<String> selection = new HashSet<>();
    +    private final TupleSchema physicalSchema;
    +
    +    private LogicalTupleSchema(TupleSchema physicalSchema, Collection<String> selection) {
    +      this.physicalSchema = physicalSchema;
    +      this.selection.addAll(selection);
    +    }
    +
    +    @Override
    +    public int columnCount() { return logicalSchema.count(); }
    +
    +    @Override
    +    public int columnIndex(String colName) {
    +      return logicalSchema.indexOf(rsLoader.toKey(colName));
    +    }
    +
    +    @Override
    +    public TupleColumnSchema metadata(int colIndex) { return logicalSchema.get(colIndex); }
    +
    +    @Override
    +    public MaterializedField column(int colIndex) { return logicalSchema.get(colIndex).schema(); }
    +
    +    @Override
    +    public TupleColumnSchema metadata(String colName) { return logicalSchema.get(colName); }
    +
    +    @Override
    +    public MaterializedField column(String colName) { return logicalSchema.get(colName).schema(); }
    +
    +    @Override
    +    public int addColumn(MaterializedField columnSchema) {
    +      String key = rsLoader.toKey(columnSchema.getName());
    +      int pIndex;
    +      if (selection.contains(key)) {
    --- End diff --
    
    selection is already normalized if caseSensitive is false ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131554894
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultSetLoaderImpl.java ---
    @@ -0,0 +1,412 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.Collection;
    +
    +import org.apache.drill.common.exceptions.UserException;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.VectorContainer;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Implementation of the result set loader.
    + * @see {@link ResultSetLoader}
    + */
    +
    +public class ResultSetLoaderImpl implements ResultSetLoader, WriterIndexImpl.WriterIndexListener {
    +
    +  public static class ResultSetOptions {
    +    public final int vectorSizeLimit;
    +    public final int rowCountLimit;
    +    public final boolean caseSensitive;
    +    public final ResultVectorCache inventory;
    +    private final Collection<String> selection;
    +
    +    public ResultSetOptions() {
    +      vectorSizeLimit = ValueVector.MAX_BUFFER_SIZE;
    +      rowCountLimit = ValueVector.MAX_ROW_COUNT;
    +      caseSensitive = false;
    +      selection = null;
    +      inventory = null;
    +    }
    +
    +    public ResultSetOptions(OptionBuilder builder) {
    +      this.vectorSizeLimit = builder.vectorSizeLimit;
    +      this.rowCountLimit = builder.rowCountLimit;
    +      this.caseSensitive = builder.caseSensitive;
    +      this.selection = builder.selection;
    +      this.inventory = builder.inventory;
    +    }
    +  }
    +
    +  public static class OptionBuilder {
    +    private int vectorSizeLimit;
    +    private int rowCountLimit;
    +    private boolean caseSensitive;
    +    private Collection<String> selection;
    +    private ResultVectorCache inventory;
    +
    +    public OptionBuilder() {
    +      ResultSetOptions options = new ResultSetOptions();
    +      vectorSizeLimit = options.vectorSizeLimit;
    +      rowCountLimit = options.rowCountLimit;
    +      caseSensitive = options.caseSensitive;
    +    }
    +
    +    public OptionBuilder setCaseSensitive(boolean flag) {
    +      caseSensitive = flag;
    +      return this;
    +    }
    +
    +    public OptionBuilder setRowCountLimit(int limit) {
    +      rowCountLimit = Math.min(limit, ValueVector.MAX_ROW_COUNT);
    +      return this;
    +    }
    +
    +    public OptionBuilder setSelection(Collection<String> selection) {
    +      this.selection = selection;
    +      return this;
    +    }
    +
    +    public OptionBuilder setVectorCache(ResultVectorCache inventory) {
    +      this.inventory = inventory;
    +      return this;
    +    }
    +
    +    // TODO: No setter for vector length yet: is hard-coded
    +    // at present in the value vector.
    +
    +    public ResultSetOptions build() {
    +      return new ResultSetOptions(this);
    +    }
    +  }
    +
    +  public static class VectorContainerBuilder {
    +    private final ResultSetLoaderImpl rowSetMutator;
    +    private int lastUpdateVersion = -1;
    +    private VectorContainer container;
    +
    +    public VectorContainerBuilder(ResultSetLoaderImpl rowSetMutator) {
    +      this.rowSetMutator = rowSetMutator;
    +      container = new VectorContainer(rowSetMutator.allocator);
    +    }
    +
    +    public void update() {
    +      if (lastUpdateVersion < rowSetMutator.schemaVersion()) {
    +        rowSetMutator.rootTuple.buildContainer(this);
    +        container.buildSchema(SelectionVectorMode.NONE);
    +        lastUpdateVersion = rowSetMutator.schemaVersion();
    +      }
    +    }
    +
    +    public VectorContainer container() { return container; }
    +
    +    public int lastUpdateVersion() { return lastUpdateVersion; }
    +
    +    public void add(ValueVector vector) {
    +      container.add(vector);
    +    }
    +  }
    +
    +  private enum State {
    +    /**
    +     * Before the first batch.
    +     */
    +    START,
    +    /**
    +     * Writing to a batch normally.
    +     */
    +    ACTIVE,
    +    /**
    +     * Batch overflowed a vector while writing. Can continue
    +     * to write to a temporary "overflow" batch until the
    +     * end of the current row.
    +     */
    +    OVERFLOW,
    +    /**
    +     * Batch is full due to reaching the row count limit
    +     * when saving a row.
    +     * No more writes allowed until harvesting the current batch.
    +     */
    +    FULL_BATCH,
    +
    +    /**
    +     * Current batch was harvested: data is gone. A lookahead
    +     * row may exist for the next batch.
    +     */
    +    HARVESTED,
    +    /**
    +     * Mutator is closed: no more operations are allowed.
    +     */
    +    CLOSED
    +  }
    +
    +  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ResultSetLoaderImpl.class);
    +
    +  private final ResultSetOptions options;
    +  private final BufferAllocator allocator;
    +  private final TupleSetImpl rootTuple;
    +  private final TupleLoader rootWriter;
    +  private final WriterIndexImpl writerIndex;
    +  private final ResultVectorCache inventory;
    +  private ResultSetLoaderImpl.State state = State.START;
    +  private int activeSchemaVersion = 0;
    +  private int harvestSchemaVersion = 0;
    --- End diff --
    
    It is not obvious how the schema version is used. Comments would be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers closed the pull request at:

    https://github.com/apache/drill/pull/866


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131685349
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/TupleSetImpl.java ---
    @@ -0,0 +1,551 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.drill.common.types.TypeProtos.DataMode;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema;
    +import org.apache.drill.exec.physical.rowSet.impl.ResultSetLoaderImpl.VectorContainerBuilder;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.AllocationHelper;
    +import org.apache.drill.exec.vector.ValueVector;
    +import org.apache.drill.exec.vector.VectorOverflowException;
    +import org.apache.drill.exec.vector.accessor.impl.AbstractColumnWriter;
    +import org.apache.drill.exec.vector.accessor.impl.ColumnAccessorFactory;
    +
    +/**
    + * Implementation of a column when creating a row batch.
    + * Every column resides at an index, is defined by a schema,
    + * is backed by a value vector, and and is written to by a writer.
    + * Each column also tracks the schema version in which it was added
    + * to detect schema evolution. Each column has an optional overflow
    + * vector that holds overflow record values when a batch becomes
    + * full.
    + * <p>
    + * Overflow vectors require special consideration. The vector class itself
    + * must remain constant as it is bound to the writer. To handle overflow,
    + * the implementation must replace the buffer in the vector with a new
    + * one, saving the full vector to return as part of the final row batch.
    + * This puts the column in one of three states:
    + * <ul>
    + * <li>Normal: only one vector is of concern - the vector for the active
    + * row batch.</li>
    + * <li>Overflow: a write to a vector caused overflow. For all columns,
    + * the data buffer is shifted to a harvested vector, and a new, empty
    + * buffer is put into the active vector.</li>
    + * <li>Excess: a (small) column received values for the row that will
    --- End diff --
    
    'Excess' is the LOOK_AHEAD state, correct? I think it would be better if the comments use the same terminology as in the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131564509
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultVectorCache.java ---
    @@ -0,0 +1,181 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +
    +import org.apache.drill.common.types.TypeProtos.MajorType;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Manages an inventory of value vectors used across row batch readers.
    + * Drill semantics for batches is complex. Each operator logically returns
    + * a batch of records on each call of the Drill Volcano iterator protocol
    + * <tt>next()</tt> operation. However, the batches "returned" are not
    + * separate objects. Instead, Drill enforces the following semantics:
    + * <ul>
    + * <li>If a <tt>next()</tt> call returns <tt>OK</tt> then the set of vectors
    + * in the "returned" batch must be identical to those in the prior batch. Not
    + * just the same type; they must be the same <tt>ValueVector</tt> objects.
    + * (The buffers within the vectors will be different.)</li>
    + * <li>If the set of vectors changes in any way (add a vector, remove a
    + * vector, change the type of a vector), then the <tt>next()</tt> call
    + * <b>must</b> return <tt>OK_NEW_SCHEMA</tt>.</ul>
    + * </ul>
    + * These rules create interesting constraints for the scan operator.
    + * Conceptually, each batch is distinct. But, it must share vectors. The
    + * {@link ResultSetLoader} class handles this by managing the set of vectors
    + * used by a single reader.
    + * <p>
    + * Readers are independent: each may read a distinct schema (as in JSON.)
    + * Yet, the Drill protocol requires minimizing spurious <tt>OK_NEW_SCHEMA</tt>
    + * events. As a result, two readers run by the same scan operator must
    + * share the same set of vectors, despite the fact that they may have
    + * different schemas and thus different <tt>ResultSetLoader</tt>s.
    + * <p>
    + * The purpose of this inventory is to persist vectors across readers, even
    + * when, say, reader B does not use a vector that reader A created.
    + * <p>
    + * The semantics supported by this class include:
    + * <ul>
    + * <li>Ability to "pre-declare" columns based on columns that appear in
    + * an explicit select list. This ensures that the columns are known (but
    + * not their types).</li>
    + * <li>Ability to reuse a vector across readers if the column retains the same
    + * name and type (minor type and mode.)</li>
    + * <li>Ability to flush unused vectors for readers with changing schemas
    + * if a schema change occurs.</li>
    + * <li>Support schema "hysteresis"; that is, the a "sticky" schema that
    + * minimizes spurious changes. Once a vector is declared, it can be included
    + * in all subsequent batches (provided the column is nullable or an array.)</li>
    + * </ul>
    + */
    +public class ResultVectorCache {
    +
    +  /**
    +   * State of a projected vector. At first all we have is a name.
    +   * Later, we'll discover the type.
    +   */
    +
    +  private static class VectorState {
    +    protected final String name;
    +    protected ValueVector vector;
    +    protected boolean touched;
    +
    +    public VectorState(String name) {
    +      this.name = name;
    +    }
    +
    +    public boolean satisfies(MaterializedField colSchema) {
    +      if (vector == null) {
    +        return false;
    +      }
    +      MaterializedField vectorSchema = vector.getField();
    +      return vectorSchema.getType().equals(colSchema.getType());
    +    }
    +  }
    +
    +  private final BufferAllocator allocator;
    +  private final Map<String, VectorState> vectors = new HashMap<>();
    +
    +  public ResultVectorCache(BufferAllocator allocator) {
    +    this.allocator = allocator;
    +  }
    +
    +  public void predefine(List<String> selected) {
    +    for (String colName : selected) {
    +      addVector(colName);
    +    }
    +  }
    +
    +  private VectorState addVector(String colName) {
    +    VectorState vs = new VectorState(colName);
    +    vectors.put(vs.name, vs);
    +    return vs;
    +  }
    +
    +  public void newBatch() {
    +    for (VectorState vs : vectors.values()) {
    +      vs.touched = false;
    +    }
    +  }
    +
    +  public void trimUnused() {
    --- End diff --
    
    What is this for ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r133618125
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultSetLoaderImpl.java ---
    @@ -0,0 +1,412 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.Collection;
    +
    +import org.apache.drill.common.exceptions.UserException;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.VectorContainer;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Implementation of the result set loader.
    + * @see {@link ResultSetLoader}
    + */
    +
    +public class ResultSetLoaderImpl implements ResultSetLoader, WriterIndexImpl.WriterIndexListener {
    +
    +  public static class ResultSetOptions {
    +    public final int vectorSizeLimit;
    +    public final int rowCountLimit;
    +    public final boolean caseSensitive;
    +    public final ResultVectorCache inventory;
    +    private final Collection<String> selection;
    +
    +    public ResultSetOptions() {
    +      vectorSizeLimit = ValueVector.MAX_BUFFER_SIZE;
    +      rowCountLimit = ValueVector.MAX_ROW_COUNT;
    +      caseSensitive = false;
    +      selection = null;
    +      inventory = null;
    +    }
    +
    +    public ResultSetOptions(OptionBuilder builder) {
    +      this.vectorSizeLimit = builder.vectorSizeLimit;
    +      this.rowCountLimit = builder.rowCountLimit;
    +      this.caseSensitive = builder.caseSensitive;
    +      this.selection = builder.selection;
    +      this.inventory = builder.inventory;
    +    }
    +  }
    +
    +  public static class OptionBuilder {
    +    private int vectorSizeLimit;
    +    private int rowCountLimit;
    +    private boolean caseSensitive;
    +    private Collection<String> selection;
    +    private ResultVectorCache inventory;
    +
    +    public OptionBuilder() {
    +      ResultSetOptions options = new ResultSetOptions();
    +      vectorSizeLimit = options.vectorSizeLimit;
    +      rowCountLimit = options.rowCountLimit;
    +      caseSensitive = options.caseSensitive;
    +    }
    +
    +    public OptionBuilder setCaseSensitive(boolean flag) {
    +      caseSensitive = flag;
    +      return this;
    +    }
    +
    +    public OptionBuilder setRowCountLimit(int limit) {
    +      rowCountLimit = Math.min(limit, ValueVector.MAX_ROW_COUNT);
    +      return this;
    +    }
    +
    +    public OptionBuilder setSelection(Collection<String> selection) {
    +      this.selection = selection;
    +      return this;
    +    }
    +
    +    public OptionBuilder setVectorCache(ResultVectorCache inventory) {
    +      this.inventory = inventory;
    +      return this;
    +    }
    +
    +    // TODO: No setter for vector length yet: is hard-coded
    +    // at present in the value vector.
    +
    +    public ResultSetOptions build() {
    +      return new ResultSetOptions(this);
    +    }
    +  }
    +
    +  public static class VectorContainerBuilder {
    +    private final ResultSetLoaderImpl rowSetMutator;
    +    private int lastUpdateVersion = -1;
    +    private VectorContainer container;
    +
    +    public VectorContainerBuilder(ResultSetLoaderImpl rowSetMutator) {
    +      this.rowSetMutator = rowSetMutator;
    +      container = new VectorContainer(rowSetMutator.allocator);
    +    }
    +
    +    public void update() {
    +      if (lastUpdateVersion < rowSetMutator.schemaVersion()) {
    +        rowSetMutator.rootTuple.buildContainer(this);
    +        container.buildSchema(SelectionVectorMode.NONE);
    +        lastUpdateVersion = rowSetMutator.schemaVersion();
    +      }
    +    }
    +
    +    public VectorContainer container() { return container; }
    +
    +    public int lastUpdateVersion() { return lastUpdateVersion; }
    +
    +    public void add(ValueVector vector) {
    +      container.add(vector);
    +    }
    +  }
    +
    +  private enum State {
    +    /**
    +     * Before the first batch.
    +     */
    +    START,
    +    /**
    +     * Writing to a batch normally.
    +     */
    +    ACTIVE,
    +    /**
    +     * Batch overflowed a vector while writing. Can continue
    +     * to write to a temporary "overflow" batch until the
    +     * end of the current row.
    +     */
    +    OVERFLOW,
    +    /**
    +     * Batch is full due to reaching the row count limit
    +     * when saving a row.
    +     * No more writes allowed until harvesting the current batch.
    +     */
    +    FULL_BATCH,
    +
    +    /**
    +     * Current batch was harvested: data is gone. A lookahead
    +     * row may exist for the next batch.
    +     */
    +    HARVESTED,
    +    /**
    +     * Mutator is closed: no more operations are allowed.
    +     */
    +    CLOSED
    +  }
    +
    +  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ResultSetLoaderImpl.class);
    +
    +  private final ResultSetOptions options;
    +  private final BufferAllocator allocator;
    +  private final TupleSetImpl rootTuple;
    +  private final TupleLoader rootWriter;
    +  private final WriterIndexImpl writerIndex;
    +  private final ResultVectorCache inventory;
    +  private ResultSetLoaderImpl.State state = State.START;
    +  private int activeSchemaVersion = 0;
    +  private int harvestSchemaVersion = 0;
    --- End diff --
    
    Added comments. See also the unit tests of this feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/866
  
    Let's defer this one so we can focus on the lower layer: the column accessors for maps and lists (DRILL-5688). Once that PR is done, we'll come back and update this one with those revisions. Please continue to get familiar with the concepts here. However, the details will change a bit to allow support for repeated maps and lists.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #866: DRILL-5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/866
  
    Closing as this PR is now superseded by #914.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131684173
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/TupleSetImpl.java ---
    @@ -0,0 +1,551 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.ArrayList;
    +import java.util.List;
    +
    +import org.apache.drill.common.types.TypeProtos.DataMode;
    +import org.apache.drill.exec.expr.TypeHelper;
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleSchema;
    +import org.apache.drill.exec.physical.rowSet.impl.ResultSetLoaderImpl.VectorContainerBuilder;
    +import org.apache.drill.exec.record.BatchSchema;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.apache.drill.exec.vector.AllocationHelper;
    +import org.apache.drill.exec.vector.ValueVector;
    +import org.apache.drill.exec.vector.VectorOverflowException;
    +import org.apache.drill.exec.vector.accessor.impl.AbstractColumnWriter;
    +import org.apache.drill.exec.vector.accessor.impl.ColumnAccessorFactory;
    +
    +/**
    + * Implementation of a column when creating a row batch.
    + * Every column resides at an index, is defined by a schema,
    + * is backed by a value vector, and and is written to by a writer.
    + * Each column also tracks the schema version in which it was added
    + * to detect schema evolution. Each column has an optional overflow
    + * vector that holds overflow record values when a batch becomes
    + * full.
    + * <p>
    + * Overflow vectors require special consideration. The vector class itself
    + * must remain constant as it is bound to the writer. To handle overflow,
    + * the implementation must replace the buffer in the vector with a new
    + * one, saving the full vector to return as part of the final row batch.
    + * This puts the column in one of three states:
    + * <ul>
    + * <li>Normal: only one vector is of concern - the vector for the active
    + * row batch.</li>
    + * <li>Overflow: a write to a vector caused overflow. For all columns,
    + * the data buffer is shifted to a harvested vector, and a new, empty
    + * buffer is put into the active vector.</li>
    + * <li>Excess: a (small) column received values for the row that will
    + * overflow due to a later column. When overflow occurs, the excess
    + * column value, from the overflow record, resides in the active
    + * vector. It must be shifted from the active vector into the new
    + * overflow buffer.
    + */
    +
    +public class TupleSetImpl implements TupleSchema {
    +
    +  public static class TupleLoaderImpl implements TupleLoader {
    +
    +    public TupleSetImpl tupleSet;
    +
    +    public TupleLoaderImpl(TupleSetImpl tupleSet) {
    +      this.tupleSet = tupleSet;
    +    }
    +
    +    @Override
    +    public TupleSchema schema() { return tupleSet; }
    +
    +    @Override
    +    public ColumnLoader column(int colIndex) {
    +      // TODO: Cache loaders here
    +      return tupleSet.columnImpl(colIndex).writer;
    +    }
    +
    +    @Override
    +    public ColumnLoader column(String colName) {
    +      ColumnImpl col = tupleSet.columnImpl(colName);
    +      if (col == null) {
    +        throw new UndefinedColumnException(colName);
    +      }
    +      return col.writer;
    +    }
    +
    +    @Override
    +    public TupleLoader loadRow(Object... values) {
    --- End diff --
    
    Is there a need to verify the types of the incoming args?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #866: Drill 5657: Implement size-aware result set loader

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/866
  
    This commit provides another two levels of foundation for size-aware vector writers in the Drill record readers.
    
    Much of the material below appears in Javadoc throughout the code. But, it is gathered here for quick reference to speed the code review.
    
    The PR is broken into commits by layer of function. May be easier to review each commit one-by-one rather than looking at the whole mess in one big diff.
    
    ## Column Accessors
    
    A recent extension to Drill's set of test tools created a "row set" abstraction to allow us to create, and verify, record batches with very few lines of code. Part of this work involved creating a set of "column accessors" in the vector subsystem. Column readers provide a uniform API to obtain data from columns (vectors), while column writers provide a uniform writing interface.
    
    DRILL-5211 discusses a set of changes to limit value vectors to 16 MB in size (to avoid memory fragmentation due to Drill's two memory allocators.) The column accessors have proven to be so useful that they will be the basis for the new, size-aware writers used by Drill's record readers.
    
    Changes include:
    
    * Implement fill-empties logic for vectors that do not provide it.
    * Use the new size-aware methods, throwing vector overflow exceptions which can now occur.
    * Some fiddling to handle the non-standard names of vector functions.
    * Modify strings to use a default type of bytes[], but offset a String version for convenience.
    * Add “finish batch” logic to handle values omitted at the end of a batch. (This is a bug in some existing record readers.)
    
    ## Result Set Loader
    
    The second layer of this commit is the new “result set loader.” This abstraction is an evolution of the “Mutator” class in the scan batch, when used with the existing column writers (which some readers use and others do not.)
    
    A result set loader loads a set of tuples (AKA records, rows) from any source (such as a record reader) into a set of record batches. The loader:
    
    * Divides records into batches based on a maximum row count or a maximum vector size, whichever occurs first. (Later revisions may limit overall batch size.)
    * Tracks the start and end of each batch.
    * Tracks the start and end of each row.
    * Provides column loaders to write each column value.
    * Handles overflow when a vector becomes full, but the client still must finish writing the current row.
    
    The original Mutator class divided up responsibilities:
    
    * The Mutator handled the entire record batch
    * An optional VectorContainerWriter writes each record
    
    The result set loader follows this same general pattern.
    
    * The result set loader handles the entire record batch (really, a series of batches that make up the entire result set: hence the name.)
    * The TupleLoader class provides per-tuple services which mostly consists of access to the column loaders.
    * A tuple schema defines the schema for the result set (see below.)
    
    To hide this complexity from the client, a ResultSetLoader interface defines the public API. Then, a ResultSetLoaderImpl class implements the interface with all the gory details. Separate classes handle each column, the result set schema, and so on.
    
    This class is pretty complex, with a state machine per batch and per column, so take your time reviewing it.
    
    ## Column Loaders
    
    The column writers are low-level classes that interface between a consumer and a value vector. To create the tuple loader we need a higher-level abstraction: the column loader. (Not that there is no equivalent for reading columns at this time: generated code does the reading in its own special way for each operator.)
    
    Column loaders have a number of responsibilities:
    
    * Single class used for all data types. No more casting.
    * Transparently handle vector overflow and rollover.
    * Provide generic (Object-based) setters, most useful for testing.
    
    Because this commit seeks to prove the concept; the column loader supports a subset of types. Adding the other types is simply a matter of copy & paste, and will be done once things settle down. For now, the focus is on int and Varchar types (though the generic version supports all types.)
    
    To handle vector overflow, each “set” method:
    
    * Tries to write the value into the current vector (using a column writer)
    * If overflow occurs, tell the listener (the row set mutator) to create a new vector
    * Try the write a second time using the new vector
    
    The set of column writers must be synchronized (not in a multi-thread sense) on the current row position. As in the row set test utilities, a WriterIndex performs this task. (In fact, this is derived from the same writer index used for the row set test code and is defined by the column accessor code.)
    
    As with the row set version, a variety of column loader implementations exist depending on whether the underlying column is a scalar, an array, a map (not yet supported), etc. All this is transparent to the client of the tuple loader.
    
    ## Vector Overflow Logic
    
    The heart of this abstraction is that last point: the ability to detect when a vector overflows, switch in a new vector, and continue writing. Several tricks are involved.
    
    Suppose we have a row of five columns: a through e. The code writes a and b. Then, c overflows. The code can’t rewrite a and b. To handle this, the tuple loader:
    
    * Creates a new, small set of vectors called the “overflow batch”
    * Copies columns a and b from the current batch to the overflow batch.
    * Writes column c to the overflow batch.
    * Allows the client code to finish writing columns d and e (to the overflow batch).
    * Reports to the client that the batch is full.
    
    Note that the client is completely unaware that any of the above occurred: it just writes a row and asks if it can write another.
    
    ## Skipping Columns
    
    The loader must also handle a reader, such as Parquet, that skips columns if they are null. There were bugs in Drill’s vectors for this case and temporary patches were made in a number of places to make this work. The trick also should work for arrays (a null array is allowed, Drill represents it as an empty array.) But, this code also was broken. For good measure, the code now also allows skipping non-null columns if a good “empty” value is available: 0 for numbers, blank for strings. This behavior is needed for the CSV reader; if a line is missing a field, the CSV reader treats it as an empty (not null) field.
    
    ## Result Set Schema
    
    The tuple loader is designed to handle two kinds of tables: “early schema” (such as Parquet and CSV) define the schema up front. “Late schema” (such as JSON) discover the schema during reading. The tuple loader allows either form, and, in fact, uses the same mechanism. (The only caveat is that issues occur if adding a non-null column after the first row has been loaded.)
    
    Consumer of batches will, of course, want to know that the schema changed. Providing a simple flag is muddy: when should it be reset? A better solution is to provide a schema version which is incremented each time a column is added. (Columns cannot be removed or changed — at least not yet.)
    
    ## Internal Vectors vs. Vector Container
    
    The result set loader uses its own mechanism to manage vectors within the loader. Vectors are stored on each column to allow quick, indexed access and to simplify creating new columns.
    
    However, the consumer of the batch (eventually, a new scan batch), wants a vector container. A special class handles this translation, including incrementally modifying the container as new columns are added.
    
    ## Logical Tuples
    
    As if the above were not complex enough, we must deal with another layer of complexity. Suppose we have a query of the form:
    
    ```
    SELECT * FROM myTable
    ```
    
    In such a query, the reader will read all columns using the tuple loader. Very simple. But, many queries are of the form:
    
    ```
    SELECT a, b FROM myTable
    ```
    
    Where “myTable” contains columns (a, b, c, d, e). There is no point in reading columns c, d and e: we’d just throw them away. Instead, we want to define a “logical tuple” that contains just (a, b) and not even read the others.
    
    Each Drill record reader does this in its own way. The tuple loader provides a new, standard solution in the form of a logical tuple loader.
    
    The logical tuple loader works just like the regular one: but it knows which columns are projected and which are not. If the reader asks for a projected column, the logical loader returns a column loader to load the value. But, when the reader asks for a non-projected column, the logical loader simply returns null, telling the application to discard that column (or, better, to not read it at all.)
    
    The logical loader is needed because the regular loader will create columns on the fly: the logical loader intercepts the column request and returns null instead.
    
    ## Materialized Schema
    
    For reasons that will become clear in the next PR, the scan batch ends up doing quite a bit of semantic analysis to map from the select list and the table schema to the result schema. Drill provides a BatchSchema class that is useful, but limited in this context. To solve this problem, a new class, MaterializedSchema, does what BatchSchema does, but allows fast access by both name and position, and allows the schema to grow dynamically.
    
    The row set abstractions for testing already had a concept of a tuple schema, so this was extracted and extended to act as the foundation for the materialized schema.
    
    ## Result Vector Cache
    
    Above we mentioned that the tuple loader allows schema changes on the fly. As the next PR will make more clear, downstream operators want a fixed set of vectors. To assist with this, the tuple loader uses a “result vector cache”. Let’s say a scanner reads two JSON files with the same schema. The first crates the schema and vectors. The second is obligated to use the same vectors. This is a royal pain. But, the vector cache does it automatically: when the tuple loader adds a new column, it checks if the vector already exists in the cache and reuses it. If not there, the cache adds it and returns it so that it is there the next time around.
    
    ## Map, List, Union and Other Complex Support
    
    This commit does not yet address complex types such as maps, lists, union vectors, and so on. The idea is to get the basics to work first. The commit does, however, support arrays of primitive and Varchar types.
    
    ## Row Set Test Classes
    
    The row set test classes and the above new classes share the same column accessors. The test classes were updated to catch the new overflow exception. Because the test code is used to creates small batches as test input data, the overflow exception is translated to an unchecked exception to keep test code simple.
    
    Several row set index classes were moved and adjusted to use the revised form needed for the tuple loader.
    
    A few names were changed to reduce confusion (mine) over what they meant.
    
    ## Unit Tests
    
    All of the above is pretty thoroughly tested via unit tests. In fact, the unit tests are a good place to start (now I tell you!) in order to see how client code uses the various abstractions.
    
    The bit of unit test structure that handled system options turned out to be wrong. Modified it to use the defaults defined in the system option manager, which required changing the visibility of the defaults table.
    
    ## Other
    
    Some unit tests were updated to use new features which become available in this PR. See TestFillEmpties and TestVectorLimits.
    
    The `equals()` method in BatchSchema is badly broken. Cleaned it up some. But, didn’t want to change it too much in case anything depends on the current, broken, semantics. So, added a new `isEquivalent` method to provide the correct semantics. Added an `isEquivalent()` method to the MaterializedField as well that will ignore the “implementation” columns that hang off of types such as nullables, repeated, etc. That is, two repeated columns are identical if their type is identical, regardless of whether one has the “$offsets” child or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131459203
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ResultSetLoaderImpl.java ---
    @@ -0,0 +1,412 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import java.util.Collection;
    +
    +import org.apache.drill.common.exceptions.UserException;
    +import org.apache.drill.exec.memory.BufferAllocator;
    +import org.apache.drill.exec.physical.rowSet.ResultSetLoader;
    +import org.apache.drill.exec.physical.rowSet.TupleLoader;
    +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
    +import org.apache.drill.exec.record.VectorContainer;
    +import org.apache.drill.exec.vector.ValueVector;
    +
    +/**
    + * Implementation of the result set loader.
    + * @see {@link ResultSetLoader}
    + */
    +
    +public class ResultSetLoaderImpl implements ResultSetLoader, WriterIndexImpl.WriterIndexListener {
    +
    +  public static class ResultSetOptions {
    +    public final int vectorSizeLimit;
    +    public final int rowCountLimit;
    +    public final boolean caseSensitive;
    +    public final ResultVectorCache inventory;
    --- End diff --
    
    The name 'inventory' does not convey the intent clearly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r131216670
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/ColumnLoaderImpl.java ---
    @@ -0,0 +1,31 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet.impl;
    +
    +import org.apache.drill.exec.physical.rowSet.ColumnLoader;
    +
    +/**
    + * Implementation interface for a column loader. Adds to the public interface
    + * a number of methods needed to coordinate batch overflow.
    + */
    +
    +public interface ColumnLoaderImpl extends ColumnLoader {
    --- End diff --
    
    "Impl"  in an interface name sounds odd.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #866: DRILL-5657: Implement size-aware result set loader

Posted by bitblender <gi...@git.apache.org>.

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/866#discussion_r130250994
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/ResultSetLoader.java ---
    @@ -0,0 +1,170 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.physical.rowSet;
    +
    +import org.apache.drill.exec.record.VectorContainer;
    +
    +/**
    + * Builds a result set (series of zero or more row sets) based on a defined
    + * schema which may
    + * evolve (expand) over time. Automatically rolls "overflow" rows over
    + * when a batch fills.
    + * <p>
    + * Many of the methods in this interface are verify that the loader is
    --- End diff --
    
    "to verify"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---