You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2018/02/14 03:46:00 UTC

[jira] [Comment Edited] (HIVE-18622) Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly

    [ https://issues.apache.org/jira/browse/HIVE-18622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363446#comment-16363446 ] 

Sergey Shelukhin edited comment on HIVE-18622 at 2/14/18 3:45 AM:
------------------------------------------------------------------

I looked at most of the iteration 7 on RB (pages 10-13 remaining to go thru), and 7-8 diff.
There's one correctness issue that I found, with 0/i mixup.
Some removals of setting isNull are not clear to me.
My main concern is that either I'm missing the big picture, or there is no unified semantic approach to noNulls, and to the pattern of setting and unsetting it.
The idea of noNulls is to avoid looking at isNull, so it's not clear why certain places in the code that fill the meaningful parts of the batch with non-nulls still don't set noNulls (I commented on one or two, there are many). Seems like noNulls should be set every time there are no nulls, and isNull array should be set correctly by whoever sets noNulls to false.
So, an example uniform approach could be:
0) The only parts of the batch that matter are those included in batch size.
1) set isNull to false every time we set a non-null value and noNulls is not true.
2) when flipping noNulls to false, make sure that isNull is correct; when flipping it as part of larger loop that always sets isNull unconditionally (e.g. in TreeReader::nextVector in ORC) no additional action necessary; when looping thru a bunch of non-nulls and finding a null it may be necessary to backfill the false-s in the preceding values and rely on (1) to fill the following values once noNulls is flipped; when setting individual elements without context it may be necessary to fill the array entirely (done only once when actually flipping noNulls).

Right now it seems like some places are too conservative and fill isNull even when noNulls is true; and some actually remove isNull-setting when setting values to non-nulls, without checking for noNulls (so, presumably if noNulls is true isNull could be incorrect and it's not clear the next set that happens to be a null doesn't flip noNulls and renders the previous value invalid.
Or at least the pattern in these approaches is not clear to me - could be because the patch is so large; perhaps it should be described somewhere in one place.




was (Author: sershe):
I looked at most of the iteration 7 on RB (pages 10-13 remaining to go thru), and 7-8 diff.
There's one correctness issue that I found, with 0/i mixup.
Some removals of setting isNull are not clear to me.
My main concern is that either I'm missing the big picture, or there is no unified semantic approach to noNulls, and to the pattern of setting and unsetting it.
The idea of noNulls is to avoid looking at isNull, so it's not clear why certain places in the code that fill the meaningful parts of the batch with non-nulls still don't set noNulls (I commented on one or two, there are many). Seems like noNulls should be set every time there are noNulls, and isNull array should be set correctly by whoever sets noNulls to false.
So, the approach could be:
1) set isNull to false every time we set a non-null value and noNulls is not true.
2) when flipping noNulls to false, make sure that isNull is correct; when flipping it as part of larger loop that always sets isNull unconditionally (e.g. in TreeReader::nextVector in ORC) it's not necessary; when looping thru a bunch of non-nulls and finding a null it may be necessary to backfill the false-s in the preceding values and rely on (1) to fill the following values once noNulls is flipped; when setting individual elements without context it may be necessary to fill the array entirely (done only once when actually flipping noNulls).

Right now it seems like some places are too conservative and fill isNull even when noNulls is true; and some actually remove isNull-setting when setting values to non-nulls, without checking for noNulls (so, presumably if noNulls is true isNull could be incorrect and it's not clear the next set that happens to be a null doesn't flip noNulls and renders the previous value invalid.
Or at least the pattern in these approaches is not clear to me - could be because the patch is so large; perhaps it should be described somewhere in one place.



> Vectorization: IF Statements, Comparisons, and more do not handle NULLs correctly
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-18622
>                 URL: https://issues.apache.org/jira/browse/HIVE-18622
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Matt McCline
>            Assignee: Matt McCline
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HIVE-18622.03.patch, HIVE-18622.04.patch, HIVE-18622.05.patch, HIVE-18622.06.patch, HIVE-18622.07.patch, HIVE-18622.08.patch, HIVE-18622.09.patch, HIVE-18622.091.patch, HIVE-18622.092.patch, HIVE-18622.093.patch, HIVE-18622.094.patch, HIVE-18622.095.patch, HIVE-18622.096.patch
>
>
>  
>  Many vector expression classes are setting noNulls to true which does not work if the VRB is a scratch column being reused. The previous use may have set noNulls to false and the isNull array will have some rows marked as NULL. The result is wrong query results and sometimes NPEs (for BytesColumnVector).
> So, many vector expressions need this:
> {code:java}
>       // Carefully handle NULLs...
>       /*
>        * For better performance on LONG/DOUBLE we don't want the conditional
>        * statements inside the for loop.
>        */
>       outputColVector.noNulls = false;
>  {code}
> And, vector expressions need to make sure the isNull array entry is set when outputColVector.noNulls is false.
> And, all place that assign column value need to set noNulls to false when the value is NULL.
> Almost all cases where noNulls is set to true are incorrect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)