You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Benjamin Lerer <bl...@apache.org> on 2024/04/04 09:52:47 UTC

Re: [DISCUSS] NULL handling and the unfrozen collection issue

>
> Now, in Cassandra setting a column to null means deleting it and if *all*
> columns in a row are null the row is deleted. This might be another edge
> case...


It is slightly more complicated than that as the primary key columns count
in (*all* columns)

For example if you have the following table: CREATE TABLE tlb (pk int, c
int, v int, PRIMARY KEY (pk, c)) and the following row: INSERT INTO tlb
(pk, c, v) VALUES (1, 1, 1)
deleting the column v (DELETE v FROM %s WHERE pk = 1 AND c = 1) will not
delete the row as the primary key columns have a timestamp and therefore do
exist. So the row will still exist with a null value for column v.

If the row was created through an UPDATE (UPDATE tlb SET v = 1 WHERE pk = 1
AND c = 1) things will be different as an UPDATE statement do not create a
timestamp for the clustering columns. By consequence, if V is deleted (set
to null) the row will not have any columns anymore and will be deleted.

The issue here is that in practice we never consider partition keys or
clustering columns as null if the database returns a row for it. Whether
the primary key columns have a timestamp or not.
I believe that we should ignore that fact too as far as IS NULL/IS NOT NULL
are concerned. If a row exists, its primary columns should be considered as
not null. Otherwise we are getting on a really slippery slope. The
INSERT/UPDATE logic is confusing enough in my opinion without adding
another layer to it.

One other issue that we have though is that the code for != LWT does not
work with the Three-Valued logic. If you have: [...] IF col != 1  and col
is null then in the TVL the value should be UNKNOWN therefore the condition
should not match.
It feels to me that we should probably keep that behavior for backward
compatibility reasons but probably change the behavior in Accord if it is
not already done.



Le jeu. 21 mars 2024 à 01:10, German Eichberger via dev <
dev@cassandra.apache.org> a écrit :

> Hi,
>
> +1 I like doing it the SQL way. This makes sense to me.
>
> Now, in Cassandra setting a column to null means deleting it and if *all*​
> columns in a row are null the row is deleted. This might be another edge
> case...
>
> German
> ------------------------------
> *From:* Benjamin Lerer <b....@gmail.com>
> *Sent:* Wednesday, March 20, 2024 9:15 AM
> *To:* dev@cassandra.apache.org <de...@cassandra.apache.org>
> *Subject:* [EXTERNAL] [DISCUSS] NULL handling and the unfrozen collection
> issue
>
> You don't often get email from b.lerer@gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
> Hi everybody,
>
> CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a
> community to handle NULL including for things like unfrozen (multi-cell)
> collections and I would like to make a proposal for moving forward with
> NULL related issues.
>
> We have currently 2 tickets open about NULL handling (I might have missed
> others):
>
>    1. CASSANDRA-10715
>    <https://issues.apache.org/jira/browse/CASSANDRA-10715>: Allowing
>    Filtering on NULL
>    2. CASSANDRA-17762
>    <https://issues.apache.org/jira/browse/CASSANDRA-17762>: LWT IF col =
>    NULL is inconsistent with SQL NULL
>
> We also had previously some discussion on which we touched the subject:
>
>    - [DISCUSS] LWT UPDATE semantics with + and - when null
>    - CEP-15 multi key transaction syntax
>
> In all those tickets and discussions the consensus was to have a behavior
> similar to SQL.
>
> For null comparisons, SQL uses the three-value logic (
> https://modern-sql.com/concept/three-valued-logic) introducing the need
> for IS NULL and IS NOT NULL operators. Those conflict with the col = NULL
> predicate supported in LWT conditions (CASSANDRA-17762
> <https://issues.apache.org/jira/browse/CASSANDRA-17762>).
>
> So far, as Cassandra was only using inclusive operators, comparisons were
> behaving in an expected way. According to three-valued logic NULL CONTAINS
> 'foo' should return UNKNOWN and the filtering behavior should exclude
> everything which is not true.Therefore the row should not be returned as
> expected. With exclusive operators things are more tricky. NULL NOT
> CONTAINS 'foo' will also return UNKNOWN causing the row to not be returned
> which might not match people's expectations.
> This behavior can be even more confusing once you take into account empty
> and null collections. NOT CONTAINS on an empty collection will return true
> while it will return UNKNOWN on a NULL collection. Unfortunately, for
> unfrozen (multicell) collections we are unable to differentiate between an
> empty and null collection and therefore always treat empty collections as
> NULL.
> For predicates such as map[myKey] != 'foo' when myKey is not present the
> result can also be surprising as it will end up comparing NULL to 'foo'
> returning once more UNKNOWN and ignoring the row.
> In order to respect the SQL three-valued logic and be able to allow the
> user to fetch all the rows which do not contains a specific value we would
> need support IS NULL, IS NOT NULL and OR to allow query like:
> WHERE c IS NULL OR c NOT CONTAINS 'foo' / WHERE m IS NULL OR m[myKey] !=
> foo
>
> Supporting the three-valued logic makes sense to me even if some behavior
> might end up being confusing. In which case we can easily fix
> CASSANDRA-10715 and deprectate support for col = NULL/col != NULL in LWT.
>
> What is people's opinion? Should we go for the three-valued logic
> everywhere? Should we try something else?
>
>
>
>
>
>

Re: [DISCUSS] NULL handling and the unfrozen collection issue

Posted by Benjamin Lerer <b....@gmail.com>.
I found some other confusing behavior in LWT around null value and empty
multicell collection. I opened CASSANDRA-19637
<https://issues.apache.org/jira/browse/CASSANDRA-19637>for those interested.

Le jeu. 4 avr. 2024 à 18:34, Caleb Rackliffe <ca...@gmail.com> a
écrit :

> The easiest way to check out how Accord uses IS NULL and IS NOT NULL is to
> look at the examples in the cep-15-accord branch:
>
>
> https://github.com/apache/cassandra/blob/cep-15-accord/test/distributed/org/apache/cassandra/distributed/test/accord/AccordCQLTestBase.java
>
> tl;dr We did indeed try to go with an approach that more closely matches
> SQL, although there may still be some edges we didn't test.
>
> I'd have no problem w/ moving to 3-value logic everywhere, I guess, but
> "everywhere" could just mean filtering queries and Accord. (i.e. If we want
> to deprecate LWT col = NULL syntax, do we really want people rewriting
> those LWTs...or just moving to the new Accord syntax, which obviously
> supports it? We should "spend" our user query rewrite budget wisely.)
>
> On Thu, Apr 4, 2024 at 4:53 AM Benjamin Lerer <bl...@apache.org> wrote:
>
>> Now, in Cassandra setting a column to null means deleting it and if *all*
>>> columns in a row are null the row is deleted. This might be another edge
>>> case...
>>
>>
>> It is slightly more complicated than that as the primary key columns
>> count in (*all* columns)
>>
>> For example if you have the following table: CREATE TABLE tlb (pk int, c
>> int, v int, PRIMARY KEY (pk, c)) and the following row: INSERT INTO tlb
>> (pk, c, v) VALUES (1, 1, 1)
>> deleting the column v (DELETE v FROM %s WHERE pk = 1 AND c = 1) will not
>> delete the row as the primary key columns have a timestamp and therefore do
>> exist. So the row will still exist with a null value for column v.
>>
>> If the row was created through an UPDATE (UPDATE tlb SET v = 1 WHERE pk =
>> 1 AND c = 1) things will be different as an UPDATE statement do not create
>> a timestamp for the clustering columns. By consequence, if V is deleted
>> (set to null) the row will not have any columns anymore and will be deleted.
>>
>> The issue here is that in practice we never consider partition keys or
>> clustering columns as null if the database returns a row for it. Whether
>> the primary key columns have a timestamp or not.
>> I believe that we should ignore that fact too as far as IS NULL/IS NOT
>> NULL are concerned. If a row exists, its primary columns should be
>> considered as not null. Otherwise we are getting on a really slippery
>> slope. The INSERT/UPDATE logic is confusing enough in my opinion without
>> adding another layer to it.
>>
>> One other issue that we have though is that the code for != LWT does not
>> work with the Three-Valued logic. If you have: [...] IF col != 1  and col
>> is null then in the TVL the value should be UNKNOWN therefore the condition
>> should not match.
>> It feels to me that we should probably keep that behavior for backward
>> compatibility reasons but probably change the behavior in Accord if it is
>> not already done.
>>
>>
>>
>> Le jeu. 21 mars 2024 à 01:10, German Eichberger via dev <
>> dev@cassandra.apache.org> a écrit :
>>
>>> Hi,
>>>
>>> +1 I like doing it the SQL way. This makes sense to me.
>>>
>>> Now, in Cassandra setting a column to null means deleting it and if
>>> *all*​ columns in a row are null the row is deleted. This might be
>>> another edge case...
>>>
>>> German
>>> ------------------------------
>>> *From:* Benjamin Lerer <b....@gmail.com>
>>> *Sent:* Wednesday, March 20, 2024 9:15 AM
>>> *To:* dev@cassandra.apache.org <de...@cassandra.apache.org>
>>> *Subject:* [EXTERNAL] [DISCUSS] NULL handling and the unfrozen
>>> collection issue
>>>
>>> You don't often get email from b.lerer@gmail.com. Learn why this is
>>> important <https://aka.ms/LearnAboutSenderIdentification>
>>> Hi everybody,
>>>
>>> CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a
>>> community to handle NULL including for things like unfrozen (multi-cell)
>>> collections and I would like to make a proposal for moving forward with
>>> NULL related issues.
>>>
>>> We have currently 2 tickets open about NULL handling (I might have
>>> missed others):
>>>
>>>    1. CASSANDRA-10715
>>>    <https://issues.apache.org/jira/browse/CASSANDRA-10715>: Allowing
>>>    Filtering on NULL
>>>    2. CASSANDRA-17762
>>>    <https://issues.apache.org/jira/browse/CASSANDRA-17762>: LWT IF col
>>>    = NULL is inconsistent with SQL NULL
>>>
>>> We also had previously some discussion on which we touched the subject:
>>>
>>>    - [DISCUSS] LWT UPDATE semantics with + and - when null
>>>    - CEP-15 multi key transaction syntax
>>>
>>> In all those tickets and discussions the consensus was to have a
>>> behavior similar to SQL.
>>>
>>> For null comparisons, SQL uses the three-value logic (
>>> https://modern-sql.com/concept/three-valued-logic) introducing the need
>>> for IS NULL and IS NOT NULL operators. Those conflict with the col = NULL
>>> predicate supported in LWT conditions (CASSANDRA-17762
>>> <https://issues.apache.org/jira/browse/CASSANDRA-17762>).
>>>
>>> So far, as Cassandra was only using inclusive operators, comparisons
>>> were behaving in an expected way. According to three-valued logic NULL
>>> CONTAINS 'foo' should return UNKNOWN and the filtering behavior should
>>> exclude everything which is not true.Therefore the row should not be
>>> returned as expected. With exclusive operators things are more tricky.
>>> NULL NOT CONTAINS 'foo' will also return UNKNOWN causing the row to not be
>>> returned which might not match people's expectations.
>>> This behavior can be even more confusing once you take into account
>>> empty and null collections. NOT CONTAINS on an empty collection will return
>>> true while it will return UNKNOWN on a NULL collection. Unfortunately, for
>>> unfrozen (multicell) collections we are unable to differentiate between an
>>> empty and null collection and therefore always treat empty collections as
>>> NULL.
>>> For predicates such as map[myKey] != 'foo' when myKey is not present the
>>> result can also be surprising as it will end up comparing NULL to 'foo'
>>> returning once more UNKNOWN and ignoring the row.
>>> In order to respect the SQL three-valued logic and be able to allow the
>>> user to fetch all the rows which do not contains a specific value we would
>>> need support IS NULL, IS NOT NULL and OR to allow query like:
>>> WHERE c IS NULL OR c NOT CONTAINS 'foo' / WHERE m IS NULL OR m[myKey] !=
>>> foo
>>>
>>> Supporting the three-valued logic makes sense to me even if some
>>> behavior might end up being confusing. In which case we can easily fix
>>> CASSANDRA-10715 and deprectate support for col = NULL/col != NULL in LWT.
>>>
>>> What is people's opinion? Should we go for the three-valued logic
>>> everywhere? Should we try something else?
>>>
>>>
>>>
>>>
>>>
>>>

Re: [DISCUSS] NULL handling and the unfrozen collection issue

Posted by Caleb Rackliffe <ca...@gmail.com>.
The easiest way to check out how Accord uses IS NULL and IS NOT NULL is to
look at the examples in the cep-15-accord branch:

https://github.com/apache/cassandra/blob/cep-15-accord/test/distributed/org/apache/cassandra/distributed/test/accord/AccordCQLTestBase.java

tl;dr We did indeed try to go with an approach that more closely matches
SQL, although there may still be some edges we didn't test.

I'd have no problem w/ moving to 3-value logic everywhere, I guess, but
"everywhere" could just mean filtering queries and Accord. (i.e. If we want
to deprecate LWT col = NULL syntax, do we really want people rewriting
those LWTs...or just moving to the new Accord syntax, which obviously
supports it? We should "spend" our user query rewrite budget wisely.)

On Thu, Apr 4, 2024 at 4:53 AM Benjamin Lerer <bl...@apache.org> wrote:

> Now, in Cassandra setting a column to null means deleting it and if *all*
>> columns in a row are null the row is deleted. This might be another edge
>> case...
>
>
> It is slightly more complicated than that as the primary key columns count
> in (*all* columns)
>
> For example if you have the following table: CREATE TABLE tlb (pk int, c
> int, v int, PRIMARY KEY (pk, c)) and the following row: INSERT INTO tlb
> (pk, c, v) VALUES (1, 1, 1)
> deleting the column v (DELETE v FROM %s WHERE pk = 1 AND c = 1) will not
> delete the row as the primary key columns have a timestamp and therefore do
> exist. So the row will still exist with a null value for column v.
>
> If the row was created through an UPDATE (UPDATE tlb SET v = 1 WHERE pk =
> 1 AND c = 1) things will be different as an UPDATE statement do not create
> a timestamp for the clustering columns. By consequence, if V is deleted
> (set to null) the row will not have any columns anymore and will be deleted.
>
> The issue here is that in practice we never consider partition keys or
> clustering columns as null if the database returns a row for it. Whether
> the primary key columns have a timestamp or not.
> I believe that we should ignore that fact too as far as IS NULL/IS NOT
> NULL are concerned. If a row exists, its primary columns should be
> considered as not null. Otherwise we are getting on a really slippery
> slope. The INSERT/UPDATE logic is confusing enough in my opinion without
> adding another layer to it.
>
> One other issue that we have though is that the code for != LWT does not
> work with the Three-Valued logic. If you have: [...] IF col != 1  and col
> is null then in the TVL the value should be UNKNOWN therefore the condition
> should not match.
> It feels to me that we should probably keep that behavior for backward
> compatibility reasons but probably change the behavior in Accord if it is
> not already done.
>
>
>
> Le jeu. 21 mars 2024 à 01:10, German Eichberger via dev <
> dev@cassandra.apache.org> a écrit :
>
>> Hi,
>>
>> +1 I like doing it the SQL way. This makes sense to me.
>>
>> Now, in Cassandra setting a column to null means deleting it and if *all*​
>> columns in a row are null the row is deleted. This might be another edge
>> case...
>>
>> German
>> ------------------------------
>> *From:* Benjamin Lerer <b....@gmail.com>
>> *Sent:* Wednesday, March 20, 2024 9:15 AM
>> *To:* dev@cassandra.apache.org <de...@cassandra.apache.org>
>> *Subject:* [EXTERNAL] [DISCUSS] NULL handling and the unfrozen
>> collection issue
>>
>> You don't often get email from b.lerer@gmail.com. Learn why this is
>> important <https://aka.ms/LearnAboutSenderIdentification>
>> Hi everybody,
>>
>> CEP-29 (CQL NOT Operator) is hitting the grey area of how we want as a
>> community to handle NULL including for things like unfrozen (multi-cell)
>> collections and I would like to make a proposal for moving forward with
>> NULL related issues.
>>
>> We have currently 2 tickets open about NULL handling (I might have missed
>> others):
>>
>>    1. CASSANDRA-10715
>>    <https://issues.apache.org/jira/browse/CASSANDRA-10715>: Allowing
>>    Filtering on NULL
>>    2. CASSANDRA-17762
>>    <https://issues.apache.org/jira/browse/CASSANDRA-17762>: LWT IF col =
>>    NULL is inconsistent with SQL NULL
>>
>> We also had previously some discussion on which we touched the subject:
>>
>>    - [DISCUSS] LWT UPDATE semantics with + and - when null
>>    - CEP-15 multi key transaction syntax
>>
>> In all those tickets and discussions the consensus was to have a behavior
>> similar to SQL.
>>
>> For null comparisons, SQL uses the three-value logic (
>> https://modern-sql.com/concept/three-valued-logic) introducing the need
>> for IS NULL and IS NOT NULL operators. Those conflict with the col = NULL
>> predicate supported in LWT conditions (CASSANDRA-17762
>> <https://issues.apache.org/jira/browse/CASSANDRA-17762>).
>>
>> So far, as Cassandra was only using inclusive operators, comparisons were
>> behaving in an expected way. According to three-valued logic NULL CONTAINS
>> 'foo' should return UNKNOWN and the filtering behavior should exclude
>> everything which is not true.Therefore the row should not be returned as
>> expected. With exclusive operators things are more tricky. NULL NOT
>> CONTAINS 'foo' will also return UNKNOWN causing the row to not be returned
>> which might not match people's expectations.
>> This behavior can be even more confusing once you take into account empty
>> and null collections. NOT CONTAINS on an empty collection will return true
>> while it will return UNKNOWN on a NULL collection. Unfortunately, for
>> unfrozen (multicell) collections we are unable to differentiate between an
>> empty and null collection and therefore always treat empty collections as
>> NULL.
>> For predicates such as map[myKey] != 'foo' when myKey is not present the
>> result can also be surprising as it will end up comparing NULL to 'foo'
>> returning once more UNKNOWN and ignoring the row.
>> In order to respect the SQL three-valued logic and be able to allow the
>> user to fetch all the rows which do not contains a specific value we would
>> need support IS NULL, IS NOT NULL and OR to allow query like:
>> WHERE c IS NULL OR c NOT CONTAINS 'foo' / WHERE m IS NULL OR m[myKey] !=
>> foo
>>
>> Supporting the three-valued logic makes sense to me even if some behavior
>> might end up being confusing. In which case we can easily fix
>> CASSANDRA-10715 and deprectate support for col = NULL/col != NULL in LWT.
>>
>> What is people's opinion? Should we go for the three-valued logic
>> everywhere? Should we try something else?
>>
>>
>>
>>
>>
>>