You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/07/08 01:10:00 UTC

[jira] [Updated] (DRILL-5529) Fixed width, repeated vector mutators missing "fill empties" logic

     [ https://issues.apache.org/jira/browse/DRILL-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated DRILL-5529:
-------------------------------
    Fix Version/s:     (was: 1.11.0)
                   Future

> Fixed width, repeated vector mutators missing "fill empties" logic
> ------------------------------------------------------------------
>
>                 Key: DRILL-5529
>                 URL: https://issues.apache.org/jira/browse/DRILL-5529
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.8.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: Future
>
>
> Consider the Drill {{NullableVarCharVector}} type. This vector is composed of three buffers (also called vectors):
> * Is-set (bit) vector: contains 1 if the value is set, 0 if it is null.
> * Data vector; effectively a byte array in which each value is packed one after another.
> *  Offset vector, in which the entry for each row points to the first byte of the value in the data vector.
> Suppose we have the values "foo", null, "bar". Then, the vectors contain:
> {code}
> Is-Set: [1 0 1]
> Offsets: [0 3 3 6]
> Data:  [f o o b a r]
> {code}
> (Yes, there is one more offset entry than rows.)
> Suppose that the code creating the vector writes values for rows 1 and 3, but omits 2 (it is null, which is the default). How do we get that required value of 3 in the entry for row 2? The answer is that the logic for setting a value keeps track of the last write position and "backfills" missing offset values:
> {code}
>     public void setSafe(int index, ByteBuffer value, int start, int length) {
>       if (index > lastSet + 1) {
>         fillEmpties(index);
>       }
>       ...
> {code}
> So, when we write the value for row 3 ("bar") we back-fill the missing offset for row 2. So far so good.
> We can now generalize. We must to the same trick any time that we use a vector that uses an offset vector. There are three other cases:
> * Required variable-width vectors (where a missing value is the same as an empty string).
> * A repeated fixed-width vector.
> * A repeated variable-width vector (which has *two* offset vectors).
> The problem is, none of these actually provide the required code. The caller must implement its own back-fill logic else the offset vectors become corrupted.
> Consider the required {{VarCharVector}}:
> {code}
>     protected void set(int index, byte[] bytes, int start, int length) {
>       assert index >= 0;
>       final int currentOffset = offsetVector.getAccessor().get(index);
>       offsetVector.getMutator().set(index + 1, currentOffset + length);
>       data.setBytes(currentOffset, bytes, start, length);
>     }
> {code}
> As a result of this omission, any client which skips null values will corrupt offset vectors. Consider an example: "try", "foo", "", "bar". We omit writing record 2 (empty string). Desired result:
> {code}
> Data: [t r y f o o b a r]
> Offsets: [0 3 6 6 9]
> {code}
> Actual result:
> {code}
> Data: [t r y f o o b a r]
> Offsets: [0 3 6 0 9]
> {code}
> The result is that we compute the width of field 2 as -6, not 3. The value of the empty field is 9, not 0.
> A similar issue arrises with repeated vectors. Consider {{RepeatedVarCharVector}}:
> {code}
>     public void addSafe(int index, byte[] bytes, int start, int length) {
>       final int nextOffset = offsets.getAccessor().get(index+1);
>       values.getMutator().setSafe(nextOffset, bytes, start, length);
>       offsets.getMutator().setSafe(index+1, nextOffset+1);
>     }
> {code}
> Consider this example: (\["a", "b"], \[ ], \["d", "e"]).
> Expected:
> {code}
> Array Offset: [0 2 2 4]
> Value Offset: [0 1 2 3 4]
> Data: [a b d e]
> {code}
> Actual:
> {code}
> Array Offset: [0 2 0 4]
> Value Offset: [0 1 2 3 4]
> Data: [a b d e]
> {code}
> The entry for the (unwritten) position 2 is missing.
> This bug may be the root cause of several other issues found recently. (Potentially DRILL-5470 -- need to verify.)
> Two resolutions are possible:
> * Require that client code write all values, backfilling empty or null values as needed.
> * Generalize the mutators to back-fill in all cases, not just a nullable var char.
> A related issue occurs when a reader fails to do a "final fill" at the end of a batch (DRILL-5487).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)