You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/19 21:21:04 UTC

[jira] [Comment Edited] (DRILL-5529) Fixed width, repeated vector mutators missing "fill empties" logic

    [ https://issues.apache.org/jira/browse/DRILL-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018036#comment-16018036 ] 

Paul Rogers edited comment on DRILL-5529 at 5/19/17 9:20 PM:
-------------------------------------------------------------

Demonstration of failure case for {{VarCharVector}}. Code similar to above.

{code}
    VarCharVector.Mutator mutator = vector.getMutator();
    byte[] value = makeValue( "foo" );
    mutator.setSafe(0, value, 0, value.length);

    value = makeValue("bar");
    mutator.setSafe(2, value, 0, value.length);
{code}

Result is vector corruption:

{code}
Offsets: [0 3 0 3]
Data: [b a r]
{code}

The actual result is different from that suggested earlier because the mutator is for row 3 is looking in the offset for row 2 to find the start position. Since the start is 0, data was written at 0, overwriting the value at 1. This is an even more serious data corruption than the invalid offsets.


was (Author: paul-rogers):
Demonstration of failure case for {{VarCharVector}}. Code similar to above. Result is vector corruption:

{code}
    VarCharVector.Mutator mutator = vector.getMutator();
    byte[] value = makeValue( "foo" );
    mutator.setSafe(0, value, 0, value.length);

    value = makeValue("bar");
    mutator.setSafe(2, value, 0, value.length);
{code}

Results:

{code}
Offsets: [0 3 0 3]
Data: [b a r]
{code}

The actual result is different from that suggested earlier because the mutator is for row 3 is looking in the offset for row 2 to find the start position. Since the start is 0, data was written at 0, overwriting the value at 1. This is an even more serious data corruption than the invalid offsets.

> Fixed width, repeated vector mutators missing "fill empties" logic
> ------------------------------------------------------------------
>
>                 Key: DRILL-5529
>                 URL: https://issues.apache.org/jira/browse/DRILL-5529
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.11.0
>
>
> Consider the Drill {{NullableVarCharVector}} type. This vector is composed of three buffers (also called vectors):
> * Is-set (bit) vector: contains 1 if the value is set, 0 if it is null.
> * Data vector; effectively a byte array in which each value is packed one after another.
> *  Offset vector, in which the entry for each row points to the first byte of the value in the data vector.
> Suppose we have the values "foo", null, "bar". Then, the vectors contain:
> {code}
> Is-Set: [1 0 1]
> Offsets: [0 3 3 6]
> Data:  [f o o b a r]
> {code}
> (Yes, there is one more offset entry than rows.)
> Suppose that the code creating the vector writes values for rows 1 and 3, but omits 2 (it is null, which is the default). How do we get that required value of 3 in the entry for row 2? The answer is that the logic for setting a value keeps track of the last write position and "backfills" missing offset values:
> {code}
>     public void setSafe(int index, ByteBuffer value, int start, int length) {
>       if (index > lastSet + 1) {
>         fillEmpties(index);
>       }
>       ...
> {code}
> So, when we write the value for row 3 ("bar") we back-fill the missing offset for row 2. So far so good.
> We can now generalize. We must to the same trick any time that we use a vector that uses an offset vector. There are three other cases:
> * Required variable-width vectors (where a missing value is the same as an empty string).
> * A repeated fixed-width vector.
> * A repeated variable-width vector (which has *two* offset vectors).
> The problem is, none of these actually provide the required code. The caller must implement its own back-fill logic else the offset vectors become corrupted.
> Consider the required {{VarCharVector}}:
> {code}
>     protected void set(int index, byte[] bytes, int start, int length) {
>       assert index >= 0;
>       final int currentOffset = offsetVector.getAccessor().get(index);
>       offsetVector.getMutator().set(index + 1, currentOffset + length);
>       data.setBytes(currentOffset, bytes, start, length);
>     }
> {code}
> As a result of this omission, any client which skips null values will corrupt offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing record 2 (empty string). Desired result:
> {code}
> Data: [t r y f o o b a r]
> Offsets: [0 3 6 6 9]
> {code}
> Actual result:
> {code}
> Data: [t r y f o o b a r]
> Offsets: [0 3 6 0 9]
> {code}
> The result is that we compute the width of field 2 as -6, not 3. The value of the empty field is 9, not 0.
> A similar issue arrises with repeated vectors. Consider {{RepeatedVarCharVector}}:
> {code}
>     public void addSafe(int index, byte[] bytes, int start, int length) {
>       final int nextOffset = offsets.getAccessor().get(index+1);
>       values.getMutator().setSafe(nextOffset, bytes, start, length);
>       offsets.getMutator().setSafe(index+1, nextOffset+1);
>     }
> {code}
> Consider this example: (\["a", "b"], \[ ], \["d", "e"]).
> Expected:
> {code}
> Array Offset: [0 2 2 4]
> Value Offset: [0 1 2 3 4]
> Data: [a b d e]
> {code}
> Actual:
> {code}
> Array Offset: [0 2 0 4]
> Value Offset: [0 1 2 3 4]
> Data: [a b d e]
> {code}
> The entry for the (unwritten) position 2 is missing.
> This bug may be the root cause of several other issues found recently. (Potentially DRILL-5470 -- need to verify.)
> Two resolutions are possible:
> * Require that client code write all values, backfilling empty or null values as needed.
> * Generalize the mutators to back-fill in all cases, not just a nullable var char.
> A related issue occurs when a reader fails to do a "final fill" at the end of a batch (DRILL-5487).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)