You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/19 20:42:04 UTC

[jira] [Created] (DRILL-5529) Repeated vectors missing "fill empties" logic

Paul Rogers created DRILL-5529:
----------------------------------

             Summary: Repeated vectors missing "fill empties" logic
                 Key: DRILL-5529
                 URL: https://issues.apache.org/jira/browse/DRILL-5529
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.8.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
             Fix For: 1.11.0


Consider the Drill {{OptionalVarCharVector}} type. This vector is composed of three buffers (also called vectors):

* Is-set (bit) vector: contains 1 if the value is set, 0 if it is null.
* Data vector; effectively a byte array in which each value is packed one after another.
*  Offset vector, in which the entry for each row points to the first byte of the value in the data vector.

Suppose we have the values "foo", null, "bar". Then, the vectors contain:

{code}
Is-Set: [1 0 1]
Offsets: [0 3 3 6]
Data:  [f o o b a r]
{code}

(Yes, there is one more offset entry than rows.)

Suppose that the code creating the vector writes values for rows 1 and 3, but omits 2 (it is null, which is the default). How do we get that required value of 3 in the entry for row 2? The answer is that the logic for setting a value keeps track of the last write position and "backfills" missing offset values:

{code}
    public void setSafe(int index, ByteBuffer value, int start, int length) {
      if (index > lastSet + 1) {
        fillEmpties(index);
      }
      ...
{code}

So, when we write the value for row 3 ("bar") we back-fill the missing offset for row 2. So far so good.

We can now generalize. We must to the same trick any time that we use a vector that uses an offset vector. There are three other cases:

* Required variable-width vectors (where a missing value is the same as an empty string).
* A repeated fixed-width vector.
* A repeated variable-width vector (which has *two* offset vectors).

The problem is, none of these actually provide the required code. The caller must implement its own back-fill logic else the offset vectors become corrupted.

Consider the required {{VarCharVector}}:

{code}
    protected void set(int index, byte[] bytes, int start, int length) {
      assert index >= 0;
      final int currentOffset = offsetVector.getAccessor().get(index);
      offsetVector.getMutator().set(index + 1, currentOffset + length);
      data.setBytes(currentOffset, bytes, start, length);
    }
{code}

As a result of this omission, any client which skips null values will corrupt offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing record 2 (empty string). Desired result:

{code}
Data: [t r y f o o b a r]
Offsets: [0 3 6 6 9]
{code}

Actual result:

{code}
Data: [t r y f o o b a r]
Offsets: [0 3 6 0 9]
{code}

The result is that we compute the width of field 2 as -6, not 3. The value of the empty field is 9, not 0.

A similar issue arrises with repeated vectors. Consider {{RepeatedVarCharVector}}:

{code}
    public void addSafe(int index, byte[] bytes, int start, int length) {
      final int nextOffset = offsets.getAccessor().get(index+1);
      values.getMutator().setSafe(nextOffset, bytes, start, length);
      offsets.getMutator().setSafe(index+1, nextOffset+1);
    }
{code}

Consider this example: (\["a", "b"], \[ ], \["d", "e"]).

Expected:
{code}
Array Offset: [0 2 2 4]
Value Offset: [0 1 2 3 4]
Data: [a b d e]
{code}

Actual:
{code}
Array Offset: [0 2 0 4]
Value Offset: [0 1 2 3 4]
Data: [a b d e]
{code}

The entry for the (unwritten) position 2 is missing.

This bug may be the root cause of several other issues found recently. (Potentially DRILL-5470 -- need to verify.)

Two resolutions are possible:

* Require that client code write all values, backfilling empty or null values as needed.
* Generalize the mutators to back-fill in all cases, not just a nullable var char.

A related issue occurs when a reader fails to do a "final fill" at the end of a batch (DRILL-5487).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)