You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by ilooner <gi...@git.apache.org> on 2018/03/14 00:26:34 UTC

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

GitHub user ilooner opened a pull request:

    https://github.com/apache/drill/pull/1164

    DRILL-6234: Improved documentation for VariableWidthVector mutators

    I had some confusion about how setValueCount should behave for variable width vectors. I added some documentation and unit tests which explain its behavior so that others don't waste time in the future.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ilooner/drill DRILL-6234

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1164
    
----
commit 4f13fd2873a9b510a9d0105ad87f72792aa46314
Author: Timothy Farkas <ti...@...>
Date:   2018-03-14T00:24:28Z

    DRILL-6234: Improved documentation for VariableWidthVector mutators, and added simple unit tests demonstrating mutator behavior.

----


---

[GitHub] drill issue #1164: DRILL-6234: Improved documentation for VariableWidthVecto...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on the issue:

    https://github.com/apache/drill/pull/1164
  
    @paul-rogers Applied review comments.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174349801
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    +   *   It is possible to trim the vector by setting the value count to be less than the number of values currently container in the vector with {@link #setValueCount(int)}, then
    +   *   the process of setting values starts with the index after the last index.
    +   * </p>
    +   * <p>
    +   *   It is also possible to back track and set the value at an index earlier than the current index, however, the caller must assume that all data container after the last
    --- End diff --
    
    Updated


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174327511
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    --- End diff --
    
    This is very important to know. This is why spill-to-disk for hash agg will eventually cause a serious customer failure. Aggregate UDFs write to vectors to store intermediate group values. A "max" string can't. Instead, it writes to a Java object. That object will be lost on spill and reread. Will result in loosing prior max values and the aggregate starting over.
    
    So, this little note is not just a nuisance, it is the fatal flaw in how we handle the (albeit obscure) string aggregate values.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174328662
  
    --- Diff: exec/vector/src/test/java/org/apache/drill/exec/vector/VariableLengthVectorTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.vector;
    +
    +import org.apache.drill.common.types.TypeProtos;
    +import org.apache.drill.common.types.Types;
    +import org.apache.drill.exec.memory.RootAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +/**
    + * This test uses {@link VarCharVector} to test the template code in VariableLengthVector.
    + */
    +public class VariableLengthVectorTest
    +{
    +  /**
    +   * If the vector contains 1000 records, setting a value count of 1000 should work.
    +   */
    +  @Test
    +  public void testSettingSameValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        checkIndexStrings("", 0, size, accessor);
    +      } finally {
    +        vector.clear();
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Test trunicating data. If you have 10000 records, reduce the vector to 1000 records.
    +   */
    +  @Test
    +  public void testTrunicateVectorSetValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final int fluffSize = 10000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +        setSafeIndexStrings("first cut ", size, fluffSize, mutator);
    +
    +        mutator.setValueCount(fluffSize);
    +        Assert.assertEquals(fluffSize, accessor.getValueCount());
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        setSafeIndexStrings("redone cut ", size, fluffSize, mutator);
    +        mutator.setValueCount(fluffSize);
    +        Assert.assertEquals(fluffSize, accessor.getValueCount());
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +
    +        checkIndexStrings("", 0, size, accessor);
    +
    +      } finally {
    +        vector.clear();
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Set 10000 values. Then go back and set new values starting at the 1001 the record.
    --- End diff --
    
    Just FYI: the vector writers handle all this stuff for you. They allow overwriting the most recent value. They keep track of vector counts and data offsets. And so on. This is why I can offer such detailed comments: I learned how all this works when creating those classes. Would be very wonderful to start reusing that work rather than creating multiple implementations of the same thing... (Here, however, I recognize that the random access pattern wanted in your project does not match the sequential approach taken by the writers. But, as you are seeing, we can't really implement random access anyway...)


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1164


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174349826
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    +   *   It is possible to trim the vector by setting the value count to be less than the number of values currently container in the vector with {@link #setValueCount(int)}, then
    +   *   the process of setting values starts with the index after the last index.
    +   * </p>
    +   * <p>
    +   *   It is also possible to back track and set the value at an index earlier than the current index, however, the caller must assume that all data container after the last
    +   *   set index is corrupted.
    --- End diff --
    
    Added


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174349986
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    --- End diff --
    
    This is a good way to layout the information. I switched the javadoc to follow this outline.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174327027
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -506,6 +506,8 @@ public boolean isNull(int index){
       }
     
       /**
    +   * <h2>Overview</h2>
    --- End diff --
    
    Nit, but I think that h4 is the usual heading level used in Java doc comments. The higher levels are used in the surrounding generated HTML.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174327132
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    +   *   It is possible to trim the vector by setting the value count to be less than the number of values currently container in the vector with {@link #setValueCount(int)}, then
    +   *   the process of setting values starts with the index after the last index.
    +   * </p>
    +   * <p>
    +   *   It is also possible to back track and set the value at an index earlier than the current index, however, the caller must assume that all data container after the last
    --- End diff --
    
    "all data container after" --> "all data after the updated index" ?


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174328025
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    --- End diff --
    
    Might be worth summarizing how to use this:
    
    1) Write to values sequentially. Fixed-width vectors allow random access, but special care is needed.
    2) Keep track in client code of the total time count. Call `setValueCount()` once the vector is full to set the final value. (The vector does not know its count while a write is in progress.)
    3) Either take responsibility for allocating enough memory, or call the `setSafe()` methods to automatically extend the vector.
    4) Once vectors are written, they are immutable; no additional writes of any kind are allowed to that vector.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174349707
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -506,6 +506,8 @@ public boolean isNull(int index){
       }
     
       /**
    +   * <h2>Overview</h2>
    --- End diff --
    
    Fixed


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174348633
  
    --- Diff: exec/vector/src/test/java/org/apache/drill/exec/vector/VariableLengthVectorTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.vector;
    +
    +import org.apache.drill.common.types.TypeProtos;
    +import org.apache.drill.common.types.Types;
    +import org.apache.drill.exec.memory.RootAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +/**
    + * This test uses {@link VarCharVector} to test the template code in VariableLengthVector.
    + */
    +public class VariableLengthVectorTest
    +{
    +  /**
    +   * If the vector contains 1000 records, setting a value count of 1000 should work.
    +   */
    +  @Test
    +  public void testSettingSameValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        checkIndexStrings("", 0, size, accessor);
    +      } finally {
    +        vector.clear();
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Test trunicating data. If you have 10000 records, reduce the vector to 1000 records.
    +   */
    +  @Test
    +  public void testTrunicateVectorSetValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final int fluffSize = 10000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +        setSafeIndexStrings("first cut ", size, fluffSize, mutator);
    +
    +        mutator.setValueCount(fluffSize);
    +        Assert.assertEquals(fluffSize, accessor.getValueCount());
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        setSafeIndexStrings("redone cut ", size, fluffSize, mutator);
    +        mutator.setValueCount(fluffSize);
    +        Assert.assertEquals(fluffSize, accessor.getValueCount());
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +
    +        checkIndexStrings("", 0, size, accessor);
    +
    +      } finally {
    +        vector.clear();
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Set 10000 values. Then go back and set new values starting at the 1001 the record.
    --- End diff --
    
    I agree the vector writers should be used. The reason why I was looking into this is that I saw strange behavior in the legacy HashTable where setValueCount was being called with a larger valueCount than there was data in a VarCharVector. I did an ugly (and now I think incorrect work around) for the issue and set about to make setValueCount support setting a valueCount larger than the number elements in the VarCharVector. Now I am realizing the issue is with the HashTableTemplate and I need to look into why it is behaving incorrectly.
    
    Your review has been very helpful, I have a much better understanding of how value vectors should be used now. Thanks!


---

[GitHub] drill issue #1164: DRILL-6234: Improved documentation for VariableWidthVecto...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on the issue:

    https://github.com/apache/drill/pull/1164
  
    @paul-rogers 


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174327601
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    +   *   It is possible to trim the vector by setting the value count to be less than the number of values currently container in the vector with {@link #setValueCount(int)}, then
    +   *   the process of setting values starts with the index after the last index.
    +   * </p>
    +   * <p>
    +   *   It is also possible to back track and set the value at an index earlier than the current index, however, the caller must assume that all data container after the last
    +   *   set index is corrupted.
    +   * </p>
    +   * <h2>Notes</h2>
    +   * <p>
    +   *   There is no gaurantee the data buffer for the {@link VariableWidthVector} will have enough space to contain the data you set unless you use setSafe. If you
    +   *   use set you may get array index out of bounds exceptions.
    --- End diff --
    
    Said another way, either 1) be careful to manage your own memory, or 2) call `setSafe()`. That is, in fact, why `setSafe()` exists.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174327182
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    +   *   It is possible to trim the vector by setting the value count to be less than the number of values currently container in the vector with {@link #setValueCount(int)}, then
    +   *   the process of setting values starts with the index after the last index.
    +   * </p>
    +   * <p>
    +   *   It is also possible to back track and set the value at an index earlier than the current index, however, the caller must assume that all data container after the last
    +   *   set index is corrupted.
    --- End diff --
    
    Maybe add "changing the index does not release memory after the index."


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174343151
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    --- End diff --
    
    Thanks for bringing this up. I'm sharing a design doc on the dev list tomorrow or the day after about how I plan to refactor HashAgg. It will cover how to facilitate unit tests and how to change the memory handling to use a deterministic calculator like the SortMemoryManager and soon to be introduced HashJoinMemoryCalculator (instead of catch OOMs). Perhaps you could comment on the doc about how to set ourselves up to fix this case.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174350127
  
    --- Diff: exec/vector/src/test/java/org/apache/drill/exec/vector/VariableLengthVectorTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.vector;
    +
    +import org.apache.drill.common.types.TypeProtos;
    +import org.apache.drill.common.types.Types;
    +import org.apache.drill.exec.memory.RootAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +/**
    + * This test uses {@link VarCharVector} to test the template code in VariableLengthVector.
    + */
    +public class VariableLengthVectorTest
    +{
    +  /**
    +   * If the vector contains 1000 records, setting a value count of 1000 should work.
    +   */
    +  @Test
    +  public void testSettingSameValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        checkIndexStrings("", 0, size, accessor);
    +      } finally {
    +        vector.clear();
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Test trunicating data. If you have 10000 records, reduce the vector to 1000 records.
    +   */
    +  @Test
    +  public void testTrunicateVectorSetValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final int fluffSize = 10000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +        setSafeIndexStrings("first cut ", size, fluffSize, mutator);
    +
    +        mutator.setValueCount(fluffSize);
    +        Assert.assertEquals(fluffSize, accessor.getValueCount());
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        setSafeIndexStrings("redone cut ", size, fluffSize, mutator);
    --- End diff --
    
    Yikes! I didn't know this. Thanks for catching.


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by ilooner <gi...@git.apache.org>.
Github user ilooner commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174349888
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java ---
    @@ -514,6 +516,22 @@ public boolean isNull(int index){
        *   The equivalent Java primitive is '${minor.javaType!type.javaType}'
        *
        * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker.
    +   * </p>
    +   * <h2>Contract</h2>
    +   * <p>
    +   *   Variable length vectors do not support random writes. All set methods must be called for with a monotonically increasing consecutive sequence of indexes.
    +   *   It is possible to trim the vector by setting the value count to be less than the number of values currently container in the vector with {@link #setValueCount(int)}, then
    +   *   the process of setting values starts with the index after the last index.
    +   * </p>
    +   * <p>
    +   *   It is also possible to back track and set the value at an index earlier than the current index, however, the caller must assume that all data container after the last
    +   *   set index is corrupted.
    +   * </p>
    +   * <h2>Notes</h2>
    +   * <p>
    +   *   There is no gaurantee the data buffer for the {@link VariableWidthVector} will have enough space to contain the data you set unless you use setSafe. If you
    +   *   use set you may get array index out of bounds exceptions.
    --- End diff --
    
    Liked this refactored phrasing


---

[GitHub] drill pull request #1164: DRILL-6234: Improved documentation for VariableWid...

Posted by paul-rogers <gi...@git.apache.org>.
Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1164#discussion_r174328361
  
    --- Diff: exec/vector/src/test/java/org/apache/drill/exec/vector/VariableLengthVectorTest.java ---
    @@ -0,0 +1,152 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.drill.exec.vector;
    +
    +import org.apache.drill.common.types.TypeProtos;
    +import org.apache.drill.common.types.Types;
    +import org.apache.drill.exec.memory.RootAllocator;
    +import org.apache.drill.exec.record.MaterializedField;
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +/**
    + * This test uses {@link VarCharVector} to test the template code in VariableLengthVector.
    + */
    +public class VariableLengthVectorTest
    +{
    +  /**
    +   * If the vector contains 1000 records, setting a value count of 1000 should work.
    +   */
    +  @Test
    +  public void testSettingSameValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        checkIndexStrings("", 0, size, accessor);
    +      } finally {
    +        vector.clear();
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Test trunicating data. If you have 10000 records, reduce the vector to 1000 records.
    +   */
    +  @Test
    +  public void testTrunicateVectorSetValueCount()
    +  {
    +    try (RootAllocator allocator = new RootAllocator(10_000_000)) {
    +      final MaterializedField field = MaterializedField.create("stringCol", Types.required(TypeProtos.MinorType.VARCHAR));
    +      final VarCharVector vector = new VarCharVector(field, allocator);
    +
    +      vector.allocateNew();
    +
    +      try {
    +        final int size = 1000;
    +        final int fluffSize = 10000;
    +        final VarCharVector.Mutator mutator = vector.getMutator();
    +        final VarCharVector.Accessor accessor = vector.getAccessor();
    +
    +        setSafeIndexStrings("", 0, size, mutator);
    +        setSafeIndexStrings("first cut ", size, fluffSize, mutator);
    +
    +        mutator.setValueCount(fluffSize);
    +        Assert.assertEquals(fluffSize, accessor.getValueCount());
    +
    +        mutator.setValueCount(size);
    +        Assert.assertEquals(size, accessor.getValueCount());
    +        setSafeIndexStrings("redone cut ", size, fluffSize, mutator);
    --- End diff --
    
    While this works, we are actually violating the vector contract which is "once the value count is set, the vector becomes immutable." If the client is not done writing to the vector, the client should maintain the value count until it is finally done.


---