You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2018/03/12 00:35:05 UTC

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/1161

    DRILL-6230: Extend row set readers to handle hyper vectors

    The current row set readers have incomplete support for hyper-vectors. To add full support, we need an interface that supports either single batches or hyper batches. Accessing vectors in hyper batches differs depending on whether the vector is at the top level or is nested. See this post for details. Also includes a simpler reader template: replaces the original three classes with one, in parallel with the writers.
    
    Key changes:
    
    * Refactor the readers to generate just the required reader, then build up the optional and repeated readers as layers on top of the generated reader. This is the same structure that the writers already use.
    * Add and test support for hyper-vectors.
    * Extend the existing "vector accessor" abstraction to fully support the highly complex process of locating nested vectors (those within a map or union) in a hyper-batch.
    * Introduce the idea of a "null state" abstraction to handle the messy null handling in unions and repeated lists.
    * Modifies tests as needed for the new internal format of vector readers.
    
    To keep the PR from getting overly large, this PR strips out the actual union and list support. That support will be added in a future PR. Similarly, there are matching changes to writers that will also be done in a separate PR.
    
    Other minor changes:
    
    * Revises the previous utility PR. In some cases, it turns out to be cleaner to use a separate `mapValue()` function instead of `objArray()`, even though both produce an object array. Calling it `mapValue()` makes it a bit clearer what we're trying to accomplish.
    
    This PR is not needed for Drill 1.13; it can go into Drill 1.14.
    
    See [this post](https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades) for details of the end-state toward which this PR is one step.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-6230

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1161
    
----
commit e9891c561088d2c79ab1758dc857a8a52ec253ac
Author: Paul Rogers <pr...@...>
Date:   2018-03-11T07:43:36Z

    Accessor revisions

commit 6f6e3eb803793d71a5e8dba8362737bac66d923c
Author: Paul Rogers <pr...@...>
Date:   2018-03-11T22:41:42Z

    Merge of exec row set readers & tests

commit 65cd6205ea8e85ac4e001634ffa24268a57ce273
Author: Paul Rogers <pr...@...>
Date:   2018-03-12T00:23:35Z

    Fixed tests to remove work not in this PR

----


---

[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1161
  
    @ppadma, I believe you are working in this area, can you please take a look at your convenience? Thanks. 


---

[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1161
  
    @paul-rogers ran the tests. they all pass. 


---

[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1161
  
    Rebased on latest master. Ran Maven unit tests; they passed up to Mongo.
    
    @ppadma, please run against the pre-commit tests
    
    @parthchandra, can you give this one a committer review? It builds on some commits you looked at previously.
    
    Thanks!  


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r177564887
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/model/ReaderIndex.java ---
    @@ -28,26 +28,30 @@
     
     public abstract class ReaderIndex implements ColumnReaderIndex {
     
    -  protected int rowIndex = -1;
    +  protected int position = -1;
       protected final int rowCount;
     
       public ReaderIndex(int rowCount) {
         this.rowCount = rowCount;
       }
     
    -  public int position() { return rowIndex; }
    -  public void set(int index) { rowIndex = index; }
    +  public void set(int index) {
    +    assert position >= -1 && position <= rowCount;
    +    position = index;
    +  }
    +
    +  @Override
    +  public int logicalIndex() { return position; }
    +
    +  @Override
    +  public int size() { return rowCount; }
     
    +  @Override
       public boolean next() {
    -    if (++rowIndex < rowCount ) {
    +    if (++position < rowCount) {
           return true;
    -    } else {
    -      rowIndex--;
    -      return false;
         }
    +    position = rowCount;
    --- End diff --
    
    is there a need to set position to rowcount ? It will come here when position = rowcount 


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r177846274
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/HyperRowSetImpl.java ---
    @@ -45,8 +50,67 @@ public RowSetReader buildReader(HyperRowSet rowSet, SelectionVector4 sv4) {
           TupleMetadata schema = rowSet.schema();
           HyperRowIndex rowIndex = new HyperRowIndex(sv4);
           return new RowSetReaderImpl(schema, rowIndex,
    -          buildContainerChildren(rowSet.container(),
    -              new MetadataRetrieval(schema)));
    +          buildContainerChildren(rowSet.container(), schema));
    +    }
    +  }
    +
    +  public static class HyperRowSetBuilderImpl implements HyperRowSetBuilder {
    +
    +    private final BufferAllocator allocator;
    +    private final List<VectorContainer> batches = new ArrayList<>();
    +    private int totalRowCount;
    +
    +    public HyperRowSetBuilderImpl(BufferAllocator allocator) {
    +      this.allocator = allocator;
    +    }
    +
    +    @Override
    +    public void addBatch(SingleRowSet rowSet) {
    +      if (rowSet.rowCount() == 0) {
    +        return;
    +      }
    +      if (rowSet.indirectionType() != SelectionVectorMode.NONE) {
    +        throw new IllegalArgumentException("Batches must not have a selection vector.");
    +      }
    +      batches.add(rowSet.container());
    +      totalRowCount += rowSet.rowCount();
    +    }
    +
    +    @Override
    +    public void addBatch(VectorContainer container) {
    +      if (container.getRecordCount() == 0) {
    +        return;
    +      }
    +      if (container.getSchema().getSelectionVectorMode() != SelectionVectorMode.NONE) {
    +        throw new IllegalArgumentException("Batches must not have a selection vector.");
    +      }
    +      batches.add(container);
    +      totalRowCount += container.getRecordCount();
    +    }
    +
    +    @SuppressWarnings("resource")
    +    @Override
    +    public HyperRowSet build() throws SchemaChangeException {
    +      SelectionVector4 sv4 = new SelectionVector4(allocator, totalRowCount);
    +      ExpandableHyperContainer hyperContainer = new ExpandableHyperContainer();
    +      for (VectorContainer container : batches) {
    +        hyperContainer.addBatch(container);
    +      }
    +
    +      // TODO: This has a bug. If the hyperset has two batches with unions,
    +      // and the first union contains only VARCHAR, while the second contains
    --- End diff --
    
    As noted in the remainder of the comment: "This is only a theoretical bug as Drill does not support unions completely." That is, trying to make this case work now would require us to write special code just to set up the case, as no existing scan operators could ever produce this case. What we need is a more general JIRA: either fix unions, or deprecate them. Having a half-baked feature is quite hard to reason about.


---

[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/1161
  
    @ppadma , any partial comments that I can start to take a look at? There are still a number of PRs in this chain and it would be great if we could keep things ticking along... Thanks!


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r178445424
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/HyperRowSetImpl.java ---
    @@ -45,8 +50,67 @@ public RowSetReader buildReader(HyperRowSet rowSet, SelectionVector4 sv4) {
           TupleMetadata schema = rowSet.schema();
           HyperRowIndex rowIndex = new HyperRowIndex(sv4);
           return new RowSetReaderImpl(schema, rowIndex,
    -          buildContainerChildren(rowSet.container(),
    -              new MetadataRetrieval(schema)));
    +          buildContainerChildren(rowSet.container(), schema));
    +    }
    +  }
    +
    +  public static class HyperRowSetBuilderImpl implements HyperRowSetBuilder {
    +
    +    private final BufferAllocator allocator;
    +    private final List<VectorContainer> batches = new ArrayList<>();
    +    private int totalRowCount;
    +
    +    public HyperRowSetBuilderImpl(BufferAllocator allocator) {
    +      this.allocator = allocator;
    +    }
    +
    +    @Override
    +    public void addBatch(SingleRowSet rowSet) {
    +      if (rowSet.rowCount() == 0) {
    +        return;
    +      }
    +      if (rowSet.indirectionType() != SelectionVectorMode.NONE) {
    +        throw new IllegalArgumentException("Batches must not have a selection vector.");
    +      }
    +      batches.add(rowSet.container());
    +      totalRowCount += rowSet.rowCount();
    +    }
    +
    +    @Override
    +    public void addBatch(VectorContainer container) {
    +      if (container.getRecordCount() == 0) {
    +        return;
    +      }
    +      if (container.getSchema().getSelectionVectorMode() != SelectionVectorMode.NONE) {
    +        throw new IllegalArgumentException("Batches must not have a selection vector.");
    +      }
    +      batches.add(container);
    +      totalRowCount += container.getRecordCount();
    +    }
    +
    +    @SuppressWarnings("resource")
    +    @Override
    +    public HyperRowSet build() throws SchemaChangeException {
    +      SelectionVector4 sv4 = new SelectionVector4(allocator, totalRowCount);
    +      ExpandableHyperContainer hyperContainer = new ExpandableHyperContainer();
    +      for (VectorContainer container : batches) {
    +        hyperContainer.addBatch(container);
    +      }
    +
    +      // TODO: This has a bug. If the hyperset has two batches with unions,
    +      // and the first union contains only VARCHAR, while the second contains
    --- End diff --
    
    [DRILL-6304](https://issues.apache.org/jira/browse/DRILL-6304)


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1161


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r178445472
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestHyperVectorReaders.java ---
    @@ -0,0 +1,365 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.test.rowSet.test;
    +
    +import static org.apache.drill.test.rowSet.RowSetUtilities.mapArray;
    +import static org.apache.drill.test.rowSet.RowSetUtilities.mapValue;
    +import static org.apache.drill.test.rowSet.RowSetUtilities.strArray;
    +import static org.junit.Assert.assertEquals;
    +import static org.junit.Assert.assertFalse;
    +import static org.junit.Assert.assertTrue;
    +
    +import org.apache.drill.common.types.TypeProtos.MinorType;
    +import org.apache.drill.exec.record.metadata.TupleMetadata;
    +import org.apache.drill.exec.record.selection.SelectionVector4;
    +import org.apache.drill.test.SubOperatorTest;
    +import org.apache.drill.test.rowSet.HyperRowSetImpl;
    +import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet;
    +import org.apache.drill.test.rowSet.RowSet.HyperRowSet;
    +import org.apache.drill.test.rowSet.RowSet.SingleRowSet;
    +import org.apache.drill.test.rowSet.RowSetBuilder;
    +import org.apache.drill.test.rowSet.RowSetReader;
    +import org.apache.drill.test.rowSet.RowSetUtilities;
    +import org.apache.drill.test.rowSet.RowSetWriter;
    +import org.apache.drill.test.rowSet.schema.SchemaBuilder;
    +import org.junit.Test;
    +
    +/**
    + * Test the reader mechanism that reads rows indexed via an SV4.
    + * SV4's introduce an additional level of indexing: each row may
    + * come from a different batch. The readers use the SV4 to find
    + * the root batch and vector, then must navigate downward from that
    + * vector for maps, repeated maps, lists, unions, repeated lists,
    + * nullable vectors and variable-length vectors.
    + * <p>
    + * This test does not cover repeated vectors; those tests should be added.
    --- End diff --
    
    [DRILL-6305](https://issues.apache.org/jira/browse/DRILL-6305)


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r177833383
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/model/ReaderIndex.java ---
    @@ -28,26 +28,30 @@
     
     public abstract class ReaderIndex implements ColumnReaderIndex {
     
    -  protected int rowIndex = -1;
    +  protected int position = -1;
       protected final int rowCount;
     
       public ReaderIndex(int rowCount) {
         this.rowCount = rowCount;
       }
     
    -  public int position() { return rowIndex; }
    -  public void set(int index) { rowIndex = index; }
    +  public void set(int index) {
    +    assert position >= -1 && position <= rowCount;
    +    position = index;
    +  }
    +
    +  @Override
    +  public int logicalIndex() { return position; }
    +
    +  @Override
    +  public int size() { return rowCount; }
     
    +  @Override
       public boolean next() {
    -    if (++rowIndex < rowCount ) {
    +    if (++position < rowCount) {
           return true;
    -    } else {
    -      rowIndex--;
    -      return false;
         }
    +    position = rowCount;
    --- End diff --
    
    Not entirely necessary. This is more for safety to force the position to a known good value. But, I can remove it if desired.


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r177182331
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestHyperVectorReaders.java ---
    @@ -0,0 +1,365 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.test.rowSet.test;
    +
    +import static org.apache.drill.test.rowSet.RowSetUtilities.mapArray;
    +import static org.apache.drill.test.rowSet.RowSetUtilities.mapValue;
    +import static org.apache.drill.test.rowSet.RowSetUtilities.strArray;
    +import static org.junit.Assert.assertEquals;
    +import static org.junit.Assert.assertFalse;
    +import static org.junit.Assert.assertTrue;
    +
    +import org.apache.drill.common.types.TypeProtos.MinorType;
    +import org.apache.drill.exec.record.metadata.TupleMetadata;
    +import org.apache.drill.exec.record.selection.SelectionVector4;
    +import org.apache.drill.test.SubOperatorTest;
    +import org.apache.drill.test.rowSet.HyperRowSetImpl;
    +import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet;
    +import org.apache.drill.test.rowSet.RowSet.HyperRowSet;
    +import org.apache.drill.test.rowSet.RowSet.SingleRowSet;
    +import org.apache.drill.test.rowSet.RowSetBuilder;
    +import org.apache.drill.test.rowSet.RowSetReader;
    +import org.apache.drill.test.rowSet.RowSetUtilities;
    +import org.apache.drill.test.rowSet.RowSetWriter;
    +import org.apache.drill.test.rowSet.schema.SchemaBuilder;
    +import org.junit.Test;
    +
    +/**
    + * Test the reader mechanism that reads rows indexed via an SV4.
    + * SV4's introduce an additional level of indexing: each row may
    + * come from a different batch. The readers use the SV4 to find
    + * the root batch and vector, then must navigate downward from that
    + * vector for maps, repeated maps, lists, unions, repeated lists,
    + * nullable vectors and variable-length vectors.
    + * <p>
    + * This test does not cover repeated vectors; those tests should be added.
    --- End diff --
    
    please file a JIRA for this.


---

[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1161#discussion_r177175974
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/HyperRowSetImpl.java ---
    @@ -45,8 +50,67 @@ public RowSetReader buildReader(HyperRowSet rowSet, SelectionVector4 sv4) {
           TupleMetadata schema = rowSet.schema();
           HyperRowIndex rowIndex = new HyperRowIndex(sv4);
           return new RowSetReaderImpl(schema, rowIndex,
    -          buildContainerChildren(rowSet.container(),
    -              new MetadataRetrieval(schema)));
    +          buildContainerChildren(rowSet.container(), schema));
    +    }
    +  }
    +
    +  public static class HyperRowSetBuilderImpl implements HyperRowSetBuilder {
    +
    +    private final BufferAllocator allocator;
    +    private final List<VectorContainer> batches = new ArrayList<>();
    +    private int totalRowCount;
    +
    +    public HyperRowSetBuilderImpl(BufferAllocator allocator) {
    +      this.allocator = allocator;
    +    }
    +
    +    @Override
    +    public void addBatch(SingleRowSet rowSet) {
    +      if (rowSet.rowCount() == 0) {
    +        return;
    +      }
    +      if (rowSet.indirectionType() != SelectionVectorMode.NONE) {
    +        throw new IllegalArgumentException("Batches must not have a selection vector.");
    +      }
    +      batches.add(rowSet.container());
    +      totalRowCount += rowSet.rowCount();
    +    }
    +
    +    @Override
    +    public void addBatch(VectorContainer container) {
    +      if (container.getRecordCount() == 0) {
    +        return;
    +      }
    +      if (container.getSchema().getSelectionVectorMode() != SelectionVectorMode.NONE) {
    +        throw new IllegalArgumentException("Batches must not have a selection vector.");
    +      }
    +      batches.add(container);
    +      totalRowCount += container.getRecordCount();
    +    }
    +
    +    @SuppressWarnings("resource")
    +    @Override
    +    public HyperRowSet build() throws SchemaChangeException {
    +      SelectionVector4 sv4 = new SelectionVector4(allocator, totalRowCount);
    +      ExpandableHyperContainer hyperContainer = new ExpandableHyperContainer();
    +      for (VectorContainer container : batches) {
    +        hyperContainer.addBatch(container);
    +      }
    +
    +      // TODO: This has a bug. If the hyperset has two batches with unions,
    +      // and the first union contains only VARCHAR, while the second contains
    --- End diff --
    
    is there a JIRA for this bug ?


---

[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

Posted by ppadma <gi...@git.apache.org>.
Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1161
  
    @paul-rogers I started the review. will get back soon.


---