You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ben Kietzman <be...@rstudio.com> on 2019/08/29 13:36:49 UTC

[DISCUSS] Ternary logic

To my knowledge, there isn't explicit documentation on how null slots in an
array should be interpreted. SQL uses Kleene logic, wherein a null is
explicitly an unknown rather than a special value. This yields for example
`(null AND false) -> false`, since `(x AND false) -> false` for all
possible values of x. This is also the behavior of Gandiva's boolean
expressions.

By contrast the boolean kernels implement something closer to the behavior
of NaN: `(null AND false) -> null`. I think this is simply an error in the
boolean kernels but in any case I think explicit documentation should be
added to prevent future confusion.

https://issues.apache.org/jira/browse/ARROW-6386

Re: [DISCUSS] Ternary logic

Posted by Francois Saint-Jacques <fs...@gmail.com>.

I created the ticket https://issues.apache.org/jira/browse/ARROW-6396,
I think we can offer both.

François


On Thu, Aug 29, 2019 at 5:10 PM Ben Kietzman <be...@rstudio.com> wrote:
>
> Indeed it's not about sanitizing nulls; it's about how nulls should
> interact with boolean (and other) expressions.
>
> For purposes of discussion, I'm naming the current approach of propagating
> null "NaN logic" (since all expressions involving NaN evaluate to NaN).
>
> To give some context for this discussion, I'm currently working on support
> for filter expressions (ARROW-6243).
>
> As an example of when this would come into play, let there be a dataset
> spanning several files. The older files have an IPV4 column while the newer
> files have IPV6 as well.
> With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields
> null for all of the older files since they lack an IPV6 column (regardless
> of their IPV4 column) which
> seems undesirable.
>
> Could you explain what you mean by "safest"?
> Under NaN logic, the Kleene result can be recovered with
> (coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false))
> Under Kleene logic, the NaN result can be recovered with (case IPV4 is null
> or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end)
> I don't think we're losing information either way.
>
> I'm not attached to either system but I'd like to understand and document
> the rationale behind our choice.
>
> On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > IIUC it's not about sanitizing to false.  Ben explained it in more
> > detail in private to me, perhaps he want to copy that explanation here ;-)
> >
> > Regards
> >
> > Antoine.
> >
> >
> > Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> > > hi Ben,
> > >
> > > My instinct is that always propagating null (at least by default) is
> > > the safest choice. Applications can choose to sanitize null to false
> > > if that's what they want semantically.
> > >
> > > - Wes
> > >
> > > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <be...@rstudio.com>
> > wrote:
> > >>
> > >> To my knowledge, there isn't explicit documentation on how null slots
> > in an
> > >> array should be interpreted. SQL uses Kleene logic, wherein a null is
> > >> explicitly an unknown rather than a special value. This yields for
> > example
> > >> `(null AND false) -> false`, since `(x AND false) -> false` for all
> > >> possible values of x. This is also the behavior of Gandiva's boolean
> > >> expressions.
> > >>
> > >> By contrast the boolean kernels implement something closer to the
> > behavior
> > >> of NaN: `(null AND false) -> null`. I think this is simply an error in
> > the
> > >> boolean kernels but in any case I think explicit documentation should be
> > >> added to prevent future confusion.
> > >>
> > >> https://issues.apache.org/jira/browse/ARROW-6386
> >

Re: [DISCUSS] Ternary logic

Posted by Ben Kietzman <be...@rstudio.com>.

Indeed it's not about sanitizing nulls; it's about how nulls should
interact with boolean (and other) expressions.

For purposes of discussion, I'm naming the current approach of propagating
null "NaN logic" (since all expressions involving NaN evaluate to NaN).

To give some context for this discussion, I'm currently working on support
for filter expressions (ARROW-6243).

As an example of when this would come into play, let there be a dataset
spanning several files. The older files have an IPV4 column while the newer
files have IPV6 as well.
With NaN logic the expression (IPV4=="127.0.0.1" or IPV6=="::1") yields
null for all of the older files since they lack an IPV6 column (regardless
of their IPV4 column) which
seems undesirable.

Could you explain what you mean by "safest"?
Under NaN logic, the Kleene result can be recovered with
(coalesce(IPV4=="127.0.0.1", false) or coalesce(IPV6=="::1", false))
Under Kleene logic, the NaN result can be recovered with (case IPV4 is null
or IPV6 is null when 1 then null else IPV4=="127.0.0.1" or IPV6=="::1" end)
I don't think we're losing information either way.

I'm not attached to either system but I'd like to understand and document
the rationale behind our choice.

On Thu, Aug 29, 2019 at 1:14 PM Antoine Pitrou <an...@python.org> wrote:

>
> IIUC it's not about sanitizing to false.  Ben explained it in more
> detail in private to me, perhaps he want to copy that explanation here ;-)
>
> Regards
>
> Antoine.
>
>
> Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> > hi Ben,
> >
> > My instinct is that always propagating null (at least by default) is
> > the safest choice. Applications can choose to sanitize null to false
> > if that's what they want semantically.
> >
> > - Wes
> >
> > On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <be...@rstudio.com>
> wrote:
> >>
> >> To my knowledge, there isn't explicit documentation on how null slots
> in an
> >> array should be interpreted. SQL uses Kleene logic, wherein a null is
> >> explicitly an unknown rather than a special value. This yields for
> example
> >> `(null AND false) -> false`, since `(x AND false) -> false` for all
> >> possible values of x. This is also the behavior of Gandiva's boolean
> >> expressions.
> >>
> >> By contrast the boolean kernels implement something closer to the
> behavior
> >> of NaN: `(null AND false) -> null`. I think this is simply an error in
> the
> >> boolean kernels but in any case I think explicit documentation should be
> >> added to prevent future confusion.
> >>
> >> https://issues.apache.org/jira/browse/ARROW-6386
>

Re: [DISCUSS] Ternary logic

Posted by Antoine Pitrou <an...@python.org>.

IIUC it's not about sanitizing to false.  Ben explained it in more
detail in private to me, perhaps he want to copy that explanation here ;-)

Regards

Antoine.


Le 29/08/2019 à 19:05, Wes McKinney a écrit :
> hi Ben,
> 
> My instinct is that always propagating null (at least by default) is
> the safest choice. Applications can choose to sanitize null to false
> if that's what they want semantically.
> 
> - Wes
> 
> On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <be...@rstudio.com> wrote:
>>
>> To my knowledge, there isn't explicit documentation on how null slots in an
>> array should be interpreted. SQL uses Kleene logic, wherein a null is
>> explicitly an unknown rather than a special value. This yields for example
>> `(null AND false) -> false`, since `(x AND false) -> false` for all
>> possible values of x. This is also the behavior of Gandiva's boolean
>> expressions.
>>
>> By contrast the boolean kernels implement something closer to the behavior
>> of NaN: `(null AND false) -> null`. I think this is simply an error in the
>> boolean kernels but in any case I think explicit documentation should be
>> added to prevent future confusion.
>>
>> https://issues.apache.org/jira/browse/ARROW-6386

Re: [DISCUSS] Ternary logic

Posted by Wes McKinney <we...@gmail.com>.

hi Ben,

My instinct is that always propagating null (at least by default) is
the safest choice. Applications can choose to sanitize null to false
if that's what they want semantically.

- Wes

On Thu, Aug 29, 2019 at 8:37 AM Ben Kietzman <be...@rstudio.com> wrote:
>
> To my knowledge, there isn't explicit documentation on how null slots in an
> array should be interpreted. SQL uses Kleene logic, wherein a null is
> explicitly an unknown rather than a special value. This yields for example
> `(null AND false) -> false`, since `(x AND false) -> false` for all
> possible values of x. This is also the behavior of Gandiva's boolean
> expressions.
>
> By contrast the boolean kernels implement something closer to the behavior
> of NaN: `(null AND false) -> null`. I think this is simply an error in the
> boolean kernels but in any case I think explicit documentation should be
> added to prevent future confusion.
>
> https://issues.apache.org/jira/browse/ARROW-6386