You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Piotr Kołaczkowski <pk...@datastax.com> on 2023/04/04 09:28:15 UTC

[DISCUSS] CEP-29 CQL NOT Operator

Hi everyone!

I created a new CEP for adding NOT support to the query language and
want to start discussion around it:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator

Happy to get your feedback.
--
Piotr

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Andrés de la Peña <ad...@apache.org>.

Indeed requiring AF for "select * from ks.tb where p1 = 1 and c1 = 2 and
col2 = 1", where p1 and c1 are all the columns in the primary key, sounds
like a bug.

I think the criterion in the code is that we require AF if there is any
column restriction that cannot be processed by the primary key or a
secondary index. The error message indeed seems to reject any kind of
filtering, independently of primary key filters. We can see this even
without defined clustering keys:

CREATE TABLE t (k int PRIMARY KEY, v int);
SELECT * FROM  t WHERE  k = 1 AND v = 1; # requires AF

That clashes with documentation, where it's said that AF is required for
filters that require scanning all partitions. If we were to adapt the code
to the behaviour described in documentation we shouldn't require AF if
there are restrictions specifying a partition key. Or possibly a group of
partition keys, if a IN restriction is used. So both within row and within
partition filtering wouldn't require AF.

Regarding adding a new ALLOW FILTERING WITHIN PARTITION, I think we could
just add a guardrail to directly disallow those queries, without needing to
add the WITHIN PARTITION clause to the CQL grammar.

On Thu, 13 Apr 2023 at 11:11, Henrik Ingo <he...@datastax.com> wrote:

>
>
> On Thu, Apr 13, 2023 at 10:20 AM Miklosovic, Stefan <
> Stefan.Miklosovic@netapp.com> wrote:
>
>> Somebody correct me if I am wrong but "partition key" itself is not
>> enough (primary keys = partition keys + clustering columns). It will
>> require ALLOW FILTERING when clustering columns are not specified either.
>>
>> create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1,
>> c1));
>> select * from ks.tb where p1 = 1 and col1 = 2;     // this will require
>> allow filtering
>>
>> The documentation seems to omit this fact.
>>
>
> It does seem so.
>
> That said, personally I was assuming, and would still argue it's the
> optimal choice, that the documentation was right and reality is wrong.
>
> If there is a partition key, then the query can avoid scanning the entire
> table, across all nodes, potentially petabytes.
>
> If a query specifies a partition key but not the full clustering key, of
> course there will be some scanning needed, but this is marginal compared to
> the need to scan the entire table. Even in the worst case, a partition with
> 2 billion cells, we are talking about seconds to filter the result from the
> single partition.
>
> > Aha I get what you all mean:
>
> No, I actually think both are unnecessary. But yeah, certainly this latter
> case is a bug?
>
> henrik
>
> --
>
> Henrik Ingo
>
> c. +358 40 569 7354
>
> w. www.datastax.com
>
> <https://www.facebook.com/datastax>  <https://twitter.com/datastax>
> <https://www.linkedin.com/company/datastax/>
> <https://github.com/datastax/>
>
>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Henrik Ingo <he...@datastax.com>.

On Thu, Apr 13, 2023 at 10:20 AM Miklosovic, Stefan <
Stefan.Miklosovic@netapp.com> wrote:

> Somebody correct me if I am wrong but "partition key" itself is not enough
> (primary keys = partition keys + clustering columns). It will require ALLOW
> FILTERING when clustering columns are not specified either.
>
> create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1,
> c1));
> select * from ks.tb where p1 = 1 and col1 = 2;     // this will require
> allow filtering
>
> The documentation seems to omit this fact.
>

It does seem so.

That said, personally I was assuming, and would still argue it's the
optimal choice, that the documentation was right and reality is wrong.

If there is a partition key, then the query can avoid scanning the entire
table, across all nodes, potentially petabytes.

If a query specifies a partition key but not the full clustering key, of
course there will be some scanning needed, but this is marginal compared to
the need to scan the entire table. Even in the worst case, a partition with
2 billion cells, we are talking about seconds to filter the result from the
single partition.

> Aha I get what you all mean:

No, I actually think both are unnecessary. But yeah, certainly this latter
case is a bug?

henrik

-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

<https://www.facebook.com/datastax>  <https://twitter.com/datastax>
<https://www.linkedin.com/company/datastax/>  <https://github.com/datastax/>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by "Miklosovic, Stefan" <St...@netapp.com>.

Aha I get what you all mean:

select * from ks.tb where p1 = 1 and c1 = 2 and col2 = 1;

this will required ALLOW FILTERING too.

I agree that this is a little bit too much, we provided all keys but it still complains. We could probably lower the restrictions here.

________________________________________
From: Miklosovic, Stefan <St...@netapp.com>
Sent: Thursday, April 13, 2023 9:20
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-29 CQL NOT Operator

Somebody correct me if I am wrong but "partition key" itself is not enough (primary keys = partition keys + clustering columns). It will require ALLOW FILTERING when clustering columns are not specified either.

create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1, c1));
select * from ks.tb where p1 = 1 and col1 = 2;     // this will require allow filtering

The documentation seems to omit this fact.

________________________________________
From: Henrik Ingo <he...@datastax.com>
Sent: Thursday, April 13, 2023 8:53
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-29 CQL NOT Operator

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Wait... Why would anything require ALLOW FILTERING if the partition key is defined? That seems to contradict documentation: https://cassandra.apache.org/doc/latest/cassandra/cql/dml.html#allow-filtering

Also my intuition / expectation matches what the manual says.

henrik

On Fri, Apr 7, 2023 at 12:01 AM Jeremy Hanna <je...@gmail.com>> wrote:
Considering all of the examples require using ALLOW FILTERING with the partition key specified, I think it's appropriate to consider separating out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across the whole table.  A few years back we had a discussion about this in ASF slack in the context of capability restrictions and it seems relevant here.  That is, we don't want people to get comfortable using ALLOW FILTERING across the whole table.  However, there are times when ALLOW FILTERING within a partition is reasonable.

Ticket to discuss separating them out: https://issues.apache.org/jira/browse/CASSANDRA-15803
Summary: Perhaps add an optional [WITHIN PARTITION] or something similar to make it backwards compatible and indicate that this is purely within the specified partition.

This also gives us the ability to disallow table scan types of ALLOW FILTERING from a guard rail perspective, because the intent is explicit.  That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING WITHIN PARTITION, or whatever is decided.

I do NOT want to hijack a good discussion but I thought this separation could be useful within this context.

Jeremy

On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com>> wrote:

I love that this is finally coming to Cassandra. Absolutely hate that, once again, we'll be endorsing the use of ALLOW FILTERING. This is an anti-pattern that keeps getting legitimized.

Hot take: Should we just not do Milestones 1 and 2 and wait for an index-only Milestone 3?

Patrick

On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com>> wrote:
Overall I welcome this feature, was trying to use this around 1-2 months back and found we didn’t support, so glad to see it coming!

From a testing point of view, I think we would want to have good fuzz testing covering complex types (frozen/non-frozen collections, tuples, udt, etc.), and reverse ordering; both sections tend to cause the most problem for new features (and existing ones)

We also will want a way to disable this feature, and optionally disable at different sections (such as m2’s NOT IN for partition keys).

> On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com>> wrote:
>
> Hi everyone!
>
> I created a new CEP for adding NOT support to the query language and
> want to start discussion around it:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
>
> Happy to get your feedback.
> --
> Piotr

--

[https://lh5.googleusercontent.com/UwlCp-Ixn21QzYv9oNnaGy0cKfFk1ukEBVKSv4V3-nQShsR-cib_VeSuNm4M_xZxyAzTTr0Et7MsQuTDhUGcmWQyfVP801Flif-SGT2x38lFRGkgoMUB4cot1DB9xd7Y0x2P0wJWA-gQ5k4rzytFSoLCP4wJntmJzhlqTuQQsOanCBHeejtSBcBry5v6kw]

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com<http://www.datastax.com>

[https://lh3.googleusercontent.com/T6MEp9neZySKd-eg-tkz96Yf4qG_Xsgu-IznDkdHfsHCjAnnHQP6OsPCdj8rsDvgKs-GJS6TA7Yx5HlK-zfRlE64j0zDpDG9cI29VaG948x5xLgUU4KKctaHNAhbpJ_pDwzRag9K7yCibGblB5Ix5z6Xj99Vc92V9nYSmR4HIj5F9T_TVI7ayW2n2_lp5Q]<https://www.facebook.com/datastax>  [https://lh3.googleusercontent.com/Xrju2UthJiMtMS5jFknV8AhVO45tfhXSR6U0F8Qam1Mu2taE2SeVcl5ExaxU5l6pG0fHjv2b6vvUOe12WQldMqsOHknC7wQtBVYiX9ff3fLMtFAbjVRM0MGTKvPsjAcMI_FNvcIcuWIBP_zwRuh3b3g6hjHOW0ik9bDPuuYMvdLWIF8C8YgKDYQ-nV9dlQ] <https://twitter.com/datastax>   [https://lh5.googleusercontent.com/OS41kMrzmJhmkvdmkHU-pq69Nzy1tOz36NIwGs61oz9cGj42TTggsXk58MY1Lqn5FyIK77jedKh3UN-1RMCgCqduMQeUNU5fVKjCBNvSOpp6NjBLZp-2NMypQnw7JoyPoeI_iXfygfzquE89GLoel7Tiq1Jtz6ueaaVA9goEhUn2rWIJMQ28DPrEj4xqfg] <https://www.linkedin.com/company/datastax/>   [https://lh3.googleusercontent.com/AVBOsupbzcVirw6fNWEIerGj-CT9XuzDmGpAa5KimxCLGAICw7_TV040RngH0I_0Z9ZEWERsQOiCSqD4ORslxJdqFiuFc1qgIoA9QMXW_yRklRJrrrCO0rQ47Hvt9QtfAz7swR_Vn6N8wZPYE2APUJAo-oB_X71omearuZFBjL9VKbhbrZXn9HQ7aGSxIA] <https://github.com/datastax/>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by "Miklosovic, Stefan" <St...@netapp.com>.

Somebody correct me if I am wrong but "partition key" itself is not enough (primary keys = partition keys + clustering columns). It will require ALLOW FILTERING when clustering columns are not specified either.

create table ks.tb (p1 int, c1 int, col1 int, col2 int, primary key (p1, c1));
select * from ks.tb where p1 = 1 and col1 = 2;     // this will require allow filtering

The documentation seems to omit this fact.

________________________________________
From: Henrik Ingo <he...@datastax.com>
Sent: Thursday, April 13, 2023 8:53
To: dev@cassandra.apache.org
Subject: Re: [DISCUSS] CEP-29 CQL NOT Operator

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Wait... Why would anything require ALLOW FILTERING if the partition key is defined? That seems to contradict documentation: https://cassandra.apache.org/doc/latest/cassandra/cql/dml.html#allow-filtering

Also my intuition / expectation matches what the manual says.

henrik

On Fri, Apr 7, 2023 at 12:01 AM Jeremy Hanna <je...@gmail.com>> wrote:
Considering all of the examples require using ALLOW FILTERING with the partition key specified, I think it's appropriate to consider separating out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across the whole table.  A few years back we had a discussion about this in ASF slack in the context of capability restrictions and it seems relevant here.  That is, we don't want people to get comfortable using ALLOW FILTERING across the whole table.  However, there are times when ALLOW FILTERING within a partition is reasonable.

Ticket to discuss separating them out: https://issues.apache.org/jira/browse/CASSANDRA-15803
Summary: Perhaps add an optional [WITHIN PARTITION] or something similar to make it backwards compatible and indicate that this is purely within the specified partition.

This also gives us the ability to disallow table scan types of ALLOW FILTERING from a guard rail perspective, because the intent is explicit.  That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING WITHIN PARTITION, or whatever is decided.

I do NOT want to hijack a good discussion but I thought this separation could be useful within this context.

Jeremy

On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com>> wrote:

I love that this is finally coming to Cassandra. Absolutely hate that, once again, we'll be endorsing the use of ALLOW FILTERING. This is an anti-pattern that keeps getting legitimized.

Hot take: Should we just not do Milestones 1 and 2 and wait for an index-only Milestone 3?

Patrick

On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com>> wrote:
Overall I welcome this feature, was trying to use this around 1-2 months back and found we didn’t support, so glad to see it coming!

From a testing point of view, I think we would want to have good fuzz testing covering complex types (frozen/non-frozen collections, tuples, udt, etc.), and reverse ordering; both sections tend to cause the most problem for new features (and existing ones)

We also will want a way to disable this feature, and optionally disable at different sections (such as m2’s NOT IN for partition keys).

> On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com>> wrote:
>
> Hi everyone!
>
> I created a new CEP for adding NOT support to the query language and
> want to start discussion around it:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
>
> Happy to get your feedback.
> --
> Piotr

--

[https://lh5.googleusercontent.com/UwlCp-Ixn21QzYv9oNnaGy0cKfFk1ukEBVKSv4V3-nQShsR-cib_VeSuNm4M_xZxyAzTTr0Et7MsQuTDhUGcmWQyfVP801Flif-SGT2x38lFRGkgoMUB4cot1DB9xd7Y0x2P0wJWA-gQ5k4rzytFSoLCP4wJntmJzhlqTuQQsOanCBHeejtSBcBry5v6kw]

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com<http://www.datastax.com>

[https://lh3.googleusercontent.com/T6MEp9neZySKd-eg-tkz96Yf4qG_Xsgu-IznDkdHfsHCjAnnHQP6OsPCdj8rsDvgKs-GJS6TA7Yx5HlK-zfRlE64j0zDpDG9cI29VaG948x5xLgUU4KKctaHNAhbpJ_pDwzRag9K7yCibGblB5Ix5z6Xj99Vc92V9nYSmR4HIj5F9T_TVI7ayW2n2_lp5Q]<https://www.facebook.com/datastax>  [https://lh3.googleusercontent.com/Xrju2UthJiMtMS5jFknV8AhVO45tfhXSR6U0F8Qam1Mu2taE2SeVcl5ExaxU5l6pG0fHjv2b6vvUOe12WQldMqsOHknC7wQtBVYiX9ff3fLMtFAbjVRM0MGTKvPsjAcMI_FNvcIcuWIBP_zwRuh3b3g6hjHOW0ik9bDPuuYMvdLWIF8C8YgKDYQ-nV9dlQ] <https://twitter.com/datastax>   [https://lh5.googleusercontent.com/OS41kMrzmJhmkvdmkHU-pq69Nzy1tOz36NIwGs61oz9cGj42TTggsXk58MY1Lqn5FyIK77jedKh3UN-1RMCgCqduMQeUNU5fVKjCBNvSOpp6NjBLZp-2NMypQnw7JoyPoeI_iXfygfzquE89GLoel7Tiq1Jtz6ueaaVA9goEhUn2rWIJMQ28DPrEj4xqfg] <https://www.linkedin.com/company/datastax/>   [https://lh3.googleusercontent.com/AVBOsupbzcVirw6fNWEIerGj-CT9XuzDmGpAa5KimxCLGAICw7_TV040RngH0I_0Z9ZEWERsQOiCSqD4ORslxJdqFiuFc1qgIoA9QMXW_yRklRJrrrCO0rQ47Hvt9QtfAz7swR_Vn6N8wZPYE2APUJAo-oB_X71omearuZFBjL9VKbhbrZXn9HQ7aGSxIA] <https://github.com/datastax/>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Henrik Ingo <he...@datastax.com>.

Wait... Why would anything require ALLOW FILTERING if the partition key is
defined? That seems to contradict documentation:
https://cassandra.apache.org/doc/latest/cassandra/cql/dml.html#allow-filtering

Also my intuition / expectation matches what the manual says.

henrik

On Fri, Apr 7, 2023 at 12:01 AM Jeremy Hanna <je...@gmail.com>
wrote:

> Considering all of the examples require using ALLOW FILTERING with the
> partition key specified, I think it's appropriate to consider separating
> out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across
> the whole table.  A few years back we had a discussion about this in ASF
> slack in the context of capability restrictions and it seems relevant
> here.  That is, we don't want people to get comfortable using ALLOW
> FILTERING across the whole table.  However, there are times when ALLOW
> FILTERING within a partition is reasonable.
>
> Ticket to discuss separating them out:
> https://issues.apache.org/jira/browse/CASSANDRA-15803
> Summary: Perhaps add an optional [WITHIN PARTITION] or something similar
> to make it backwards compatible and indicate that this is purely within the
> specified partition.
>
> This also gives us the ability to disallow table scan types of ALLOW
> FILTERING from a guard rail perspective, because the intent is explicit.
> That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING
> WITHIN PARTITION, or whatever is decided.
>
> I do NOT want to hijack a good discussion but I thought this separation
> could be useful within this context.
>
> Jeremy
>
> On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com> wrote:
>
> I love that this is finally coming to Cassandra. Absolutely hate that,
> once again, we'll be endorsing the use of ALLOW FILTERING. This is an
> anti-pattern that keeps getting legitimized.
>
> Hot take: Should we just not do Milestones 1 and 2 and wait for an
> index-only Milestone 3?
>
> Patrick
>
> On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com> wrote:
>
>> Overall I welcome this feature, was trying to use this around 1-2 months
>> back and found we didn’t support, so glad to see it coming!
>>
>> From a testing point of view, I think we would want to have good fuzz
>> testing covering complex types (frozen/non-frozen collections, tuples, udt,
>> etc.), and reverse ordering; both sections tend to cause the most problem
>> for new features (and existing ones)
>>
>> We also will want a way to disable this feature, and optionally disable
>> at different sections (such as m2’s NOT IN for partition keys).
>>
>> > On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com>
>> wrote:
>> >
>> > Hi everyone!
>> >
>> > I created a new CEP for adding NOT support to the query language and
>> > want to start discussion around it:
>> >
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
>> >
>> > Happy to get your feedback.
>> > --
>> > Piotr
>>
>>
>

-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

<https://www.facebook.com/datastax>  <https://twitter.com/datastax>
<https://www.linkedin.com/company/datastax/>  <https://github.com/datastax/>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Piotr Kołaczkowski <pk...@datastax.com>.

Ok, overall I think the discussion has settled and the feature is
non-controversial, except the approach to ALLOW FILTERING.
I added a note to non goals saying that we don't want to change the
approach to ALLOW FILTERING here - and this proposal is to stay
consistent with the current approach.
We can always rethink the ALLOW FILTERING as a separate CEP as I can't
see any reason why NOT operator should be special in that case - AF
applies to all operators.

I'll start a VOTE.

Thanks,
Piotr

Piotr Kołaczkowski
e. pkolaczk@datastax.com
w. www.datastax.com


pt., 28 kwi 2023 o 11:22 Piotr Kołaczkowski <pk...@datastax.com> napisał(a):
>
> > It's easy for an inverted index to find matches efficiently, but not so easy for it to find non-matches.
>
> Yes, I agree, it is not easy for an *index* to do that.
> But I think at least in SAI we could do that by using the index to
> find the matches, and, because they are always returned in the row-id
> order, just iterate all row identifiers skipping the ones found in the
> index (so computing the complement of the set of row ids).
> So we could do that at the iterator level, not at the index level,
> which is IMHO a good thing because that wouldn't need any storage
> format changes.
>
> Piotr Kołaczkowski
> e. pkolaczk@datastax.com
> w. www.datastax.com
>
> wt., 11 kwi 2023 o 21:55 Caleb Rackliffe <ca...@gmail.com> napisał(a):
> >
> > +1 to the proposal from a CQL perspective
> >
> > However, whether we do this in the context of simple partition restriction, a global index query, or a partition-restricted index query, the NOT operator is most likely to be useful only in a post-filtering capacity. (ex. WHERE indexed_set CONTAINS { 'foo'} AND indexed_set NOT CONTAINS { 'bar' })
> >
> > Using Lucene as an example, you might remember that it doesn't (at least IIRC) allow single predicate NOT queries. (See https://stackoverflow.com/questions/3604771/not-query-in-lucene) It's easy for an inverted index to find matches efficiently, but not so easy for it to find non-matches. This is similar to, but even less-straightforward than, the issue you have w/ boolean queries when you query the less selective of the two possible values. You can create an accompanying "negated" index, but that's not free, of course.
> >
> > Again, not necessarily a problem w/ the CEP, but want to call out the potential complication...
> >
> > On Thu, Apr 6, 2023 at 4:01 PM Jeremy Hanna <je...@gmail.com> wrote:
> >>
> >> Considering all of the examples require using ALLOW FILTERING with the partition key specified, I think it's appropriate to consider separating out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across the whole table.  A few years back we had a discussion about this in ASF slack in the context of capability restrictions and it seems relevant here.  That is, we don't want people to get comfortable using ALLOW FILTERING across the whole table.  However, there are times when ALLOW FILTERING within a partition is reasonable.
> >>
> >> Ticket to discuss separating them out: https://issues.apache.org/jira/browse/CASSANDRA-15803
> >> Summary: Perhaps add an optional [WITHIN PARTITION] or something similar to make it backwards compatible and indicate that this is purely within the specified partition.
> >>
> >> This also gives us the ability to disallow table scan types of ALLOW FILTERING from a guard rail perspective, because the intent is explicit.  That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING WITHIN PARTITION, or whatever is decided.
> >>
> >> I do NOT want to hijack a good discussion but I thought this separation could be useful within this context.
> >>
> >> Jeremy
> >>
> >> On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com> wrote:
> >>
> >> I love that this is finally coming to Cassandra. Absolutely hate that, once again, we'll be endorsing the use of ALLOW FILTERING. This is an anti-pattern that keeps getting legitimized.
> >>
> >> Hot take: Should we just not do Milestones 1 and 2 and wait for an index-only Milestone 3?
> >>
> >> Patrick
> >>
> >> On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com> wrote:
> >>>
> >>> Overall I welcome this feature, was trying to use this around 1-2 months back and found we didn’t support, so glad to see it coming!
> >>>
> >>> From a testing point of view, I think we would want to have good fuzz testing covering complex types (frozen/non-frozen collections, tuples, udt, etc.), and reverse ordering; both sections tend to cause the most problem for new features (and existing ones)
> >>>
> >>> We also will want a way to disable this feature, and optionally disable at different sections (such as m2’s NOT IN for partition keys).
> >>>
> >>> > On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com> wrote:
> >>> >
> >>> > Hi everyone!
> >>> >
> >>> > I created a new CEP for adding NOT support to the query language and
> >>> > want to start discussion around it:
> >>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
> >>> >
> >>> > Happy to get your feedback.
> >>> > --
> >>> > Piotr
> >>>
> >>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Piotr Kołaczkowski <pk...@datastax.com>.

> It's easy for an inverted index to find matches efficiently, but not so easy for it to find non-matches.

Yes, I agree, it is not easy for an *index* to do that.
But I think at least in SAI we could do that by using the index to
find the matches, and, because they are always returned in the row-id
order, just iterate all row identifiers skipping the ones found in the
index (so computing the complement of the set of row ids).
So we could do that at the iterator level, not at the index level,
which is IMHO a good thing because that wouldn't need any storage
format changes.

Piotr Kołaczkowski
e. pkolaczk@datastax.com
w. www.datastax.com

wt., 11 kwi 2023 o 21:55 Caleb Rackliffe <ca...@gmail.com> napisał(a):
>
> +1 to the proposal from a CQL perspective
>
> However, whether we do this in the context of simple partition restriction, a global index query, or a partition-restricted index query, the NOT operator is most likely to be useful only in a post-filtering capacity. (ex. WHERE indexed_set CONTAINS { 'foo'} AND indexed_set NOT CONTAINS { 'bar' })
>
> Using Lucene as an example, you might remember that it doesn't (at least IIRC) allow single predicate NOT queries. (See https://stackoverflow.com/questions/3604771/not-query-in-lucene) It's easy for an inverted index to find matches efficiently, but not so easy for it to find non-matches. This is similar to, but even less-straightforward than, the issue you have w/ boolean queries when you query the less selective of the two possible values. You can create an accompanying "negated" index, but that's not free, of course.
>
> Again, not necessarily a problem w/ the CEP, but want to call out the potential complication...
>
> On Thu, Apr 6, 2023 at 4:01 PM Jeremy Hanna <je...@gmail.com> wrote:
>>
>> Considering all of the examples require using ALLOW FILTERING with the partition key specified, I think it's appropriate to consider separating out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across the whole table.  A few years back we had a discussion about this in ASF slack in the context of capability restrictions and it seems relevant here.  That is, we don't want people to get comfortable using ALLOW FILTERING across the whole table.  However, there are times when ALLOW FILTERING within a partition is reasonable.
>>
>> Ticket to discuss separating them out: https://issues.apache.org/jira/browse/CASSANDRA-15803
>> Summary: Perhaps add an optional [WITHIN PARTITION] or something similar to make it backwards compatible and indicate that this is purely within the specified partition.
>>
>> This also gives us the ability to disallow table scan types of ALLOW FILTERING from a guard rail perspective, because the intent is explicit.  That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING WITHIN PARTITION, or whatever is decided.
>>
>> I do NOT want to hijack a good discussion but I thought this separation could be useful within this context.
>>
>> Jeremy
>>
>> On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com> wrote:
>>
>> I love that this is finally coming to Cassandra. Absolutely hate that, once again, we'll be endorsing the use of ALLOW FILTERING. This is an anti-pattern that keeps getting legitimized.
>>
>> Hot take: Should we just not do Milestones 1 and 2 and wait for an index-only Milestone 3?
>>
>> Patrick
>>
>> On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com> wrote:
>>>
>>> Overall I welcome this feature, was trying to use this around 1-2 months back and found we didn’t support, so glad to see it coming!
>>>
>>> From a testing point of view, I think we would want to have good fuzz testing covering complex types (frozen/non-frozen collections, tuples, udt, etc.), and reverse ordering; both sections tend to cause the most problem for new features (and existing ones)
>>>
>>> We also will want a way to disable this feature, and optionally disable at different sections (such as m2’s NOT IN for partition keys).
>>>
>>> > On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com> wrote:
>>> >
>>> > Hi everyone!
>>> >
>>> > I created a new CEP for adding NOT support to the query language and
>>> > want to start discussion around it:
>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
>>> >
>>> > Happy to get your feedback.
>>> > --
>>> > Piotr
>>>
>>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Caleb Rackliffe <ca...@gmail.com>.

+1 to the proposal from a CQL perspective

*However*, whether we do this in the context of simple partition
restriction, a global index query, or a partition-restricted index query,
the NOT operator is most likely to be useful only in a post-filtering
capacity. (ex. WHERE indexed_set CONTAINS { 'foo'} AND indexed_set NOT
CONTAINS { 'bar' })

Using Lucene as an example, you might remember that it doesn't (at least
IIRC) allow single predicate NOT queries. (See
https://stackoverflow.com/questions/3604771/not-query-in-lucene) It's easy
for an inverted index to find matches efficiently, but not so easy for it
to find non-matches. This is similar to, but even less-straightforward
than, the issue you have w/ boolean queries when you query the less
selective of the two possible values. You can create an accompanying
"negated" index, but that's not free, of course.

Again, not necessarily a problem w/ the CEP, but want to call out the
potential complication...

On Thu, Apr 6, 2023 at 4:01 PM Jeremy Hanna <je...@gmail.com>
wrote:

> Considering all of the examples require using ALLOW FILTERING with the
> partition key specified, I think it's appropriate to consider separating
> out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across
> the whole table.  A few years back we had a discussion about this in ASF
> slack in the context of capability restrictions and it seems relevant
> here.  That is, we don't want people to get comfortable using ALLOW
> FILTERING across the whole table.  However, there are times when ALLOW
> FILTERING within a partition is reasonable.
>
> Ticket to discuss separating them out:
> https://issues.apache.org/jira/browse/CASSANDRA-15803
> Summary: Perhaps add an optional [WITHIN PARTITION] or something similar
> to make it backwards compatible and indicate that this is purely within the
> specified partition.
>
> This also gives us the ability to disallow table scan types of ALLOW
> FILTERING from a guard rail perspective, because the intent is explicit.
> That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING
> WITHIN PARTITION, or whatever is decided.
>
> I do NOT want to hijack a good discussion but I thought this separation
> could be useful within this context.
>
> Jeremy
>
> On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com> wrote:
>
> I love that this is finally coming to Cassandra. Absolutely hate that,
> once again, we'll be endorsing the use of ALLOW FILTERING. This is an
> anti-pattern that keeps getting legitimized.
>
> Hot take: Should we just not do Milestones 1 and 2 and wait for an
> index-only Milestone 3?
>
> Patrick
>
> On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com> wrote:
>
>> Overall I welcome this feature, was trying to use this around 1-2 months
>> back and found we didn’t support, so glad to see it coming!
>>
>> From a testing point of view, I think we would want to have good fuzz
>> testing covering complex types (frozen/non-frozen collections, tuples, udt,
>> etc.), and reverse ordering; both sections tend to cause the most problem
>> for new features (and existing ones)
>>
>> We also will want a way to disable this feature, and optionally disable
>> at different sections (such as m2’s NOT IN for partition keys).
>>
>> > On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com>
>> wrote:
>> >
>> > Hi everyone!
>> >
>> > I created a new CEP for adding NOT support to the query language and
>> > want to start discussion around it:
>> >
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
>> >
>> > Happy to get your feedback.
>> > --
>> > Piotr
>>
>>
>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Jeremy Hanna <je...@gmail.com>.

Considering all of the examples require using ALLOW FILTERING with the partition key specified, I think it's appropriate to consider separating out use of ALLOW FILTERING within a partition versus ALLOW FILTERING across the whole table.  A few years back we had a discussion about this in ASF slack in the context of capability restrictions and it seems relevant here.  That is, we don't want people to get comfortable using ALLOW FILTERING across the whole table.  However, there are times when ALLOW FILTERING within a partition is reasonable.

Ticket to discuss separating them out: https://issues.apache.org/jira/browse/CASSANDRA-15803
Summary: Perhaps add an optional [WITHIN PARTITION] or something similar to make it backwards compatible and indicate that this is purely within the specified partition.

This also gives us the ability to disallow table scan types of ALLOW FILTERING from a guard rail perspective, because the intent is explicit.  That operators could disallow ALLOW FILTERING but allow ALLOW FILTERING WITHIN PARTITION, or whatever is decided.

I do NOT want to hijack a good discussion but I thought this separation could be useful within this context.

Jeremy

> On Apr 6, 2023, at 3:00 PM, Patrick McFadin <pm...@gmail.com> wrote:
> 
> I love that this is finally coming to Cassandra. Absolutely hate that, once again, we'll be endorsing the use of ALLOW FILTERING. This is an anti-pattern that keeps getting legitimized.
> 
> Hot take: Should we just not do Milestones 1 and 2 and wait for an index-only Milestone 3? 
> 
> Patrick
> 
> On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dcapwell@apple.com <ma...@apple.com>> wrote:
>> Overall I welcome this feature, was trying to use this around 1-2 months back and found we didn’t support, so glad to see it coming!
>> 
>> From a testing point of view, I think we would want to have good fuzz testing covering complex types (frozen/non-frozen collections, tuples, udt, etc.), and reverse ordering; both sections tend to cause the most problem for new features (and existing ones)
>> 
>> We also will want a way to disable this feature, and optionally disable at different sections (such as m2’s NOT IN for partition keys).
>> 
>> > On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pkolaczk@datastax.com <ma...@datastax.com>> wrote:
>> > 
>> > Hi everyone!
>> > 
>> > I created a new CEP for adding NOT support to the query language and
>> > want to start discussion around it:
>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
>> > 
>> > Happy to get your feedback.
>> > --
>> > Piotr
>>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by Patrick McFadin <pm...@gmail.com>.

I love that this is finally coming to Cassandra. Absolutely hate that, once
again, we'll be endorsing the use of ALLOW FILTERING. This is an
anti-pattern that keeps getting legitimized.

Hot take: Should we just not do Milestones 1 and 2 and wait for an
index-only Milestone 3?

Patrick

On Thu, Apr 6, 2023 at 10:04 AM David Capwell <dc...@apple.com> wrote:

> Overall I welcome this feature, was trying to use this around 1-2 months
> back and found we didn’t support, so glad to see it coming!
>
> From a testing point of view, I think we would want to have good fuzz
> testing covering complex types (frozen/non-frozen collections, tuples, udt,
> etc.), and reverse ordering; both sections tend to cause the most problem
> for new features (and existing ones)
>
> We also will want a way to disable this feature, and optionally disable at
> different sections (such as m2’s NOT IN for partition keys).
>
> > On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com>
> wrote:
> >
> > Hi everyone!
> >
> > I created a new CEP for adding NOT support to the query language and
> > want to start discussion around it:
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
> >
> > Happy to get your feedback.
> > --
> > Piotr
>
>

Re: [DISCUSS] CEP-29 CQL NOT Operator

Posted by David Capwell <dc...@apple.com>.

Overall I welcome this feature, was trying to use this around 1-2 months back and found we didn’t support, so glad to see it coming!

From a testing point of view, I think we would want to have good fuzz testing covering complex types (frozen/non-frozen collections, tuples, udt, etc.), and reverse ordering; both sections tend to cause the most problem for new features (and existing ones)

We also will want a way to disable this feature, and optionally disable at different sections (such as m2’s NOT IN for partition keys).

> On Apr 4, 2023, at 2:28 AM, Piotr Kołaczkowski <pk...@datastax.com> wrote:
> 
> Hi everyone!
> 
> I created a new CEP for adding NOT support to the query language and
> want to start discussion around it:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-29%3A+CQL+NOT+operator
> 
> Happy to get your feedback.
> --
> Piotr